CONCEPT DIABETES: Analytical pipeline

Author

Ignacio Oscoz Villanueva

Introduction

CONCEPT-DIABETES

CONCEPT-DIABETES is part of CONCEPT, a coordinated research project initiative under the umbrella of REDISSEC, the Spanish Research Network on Health Services Research and Chronic Conditions] [www.redissec.com], that aims at analyzing chronic care effectiveness and efficiency in a number of cohorts built on real world data (RWD). In the specific case of CONCEPT-DIABETES, the focus will be on assessing the effectiveness of a set of clinical practice actions and quality improvement strategies at different levels (patient-level, health-care provider level and health system level) on the management and health results of patients with type 2 diabetes (T2D) using process mining methodology.

It is a population-based retrospective observational study centered on all T2D patients diagnosed in four Regional Health Services within the Spanish National Health Service, that includes information from all their contacts with the health services using the electronic medical record systems including Primary Care data, Specialist Care data, Hospitalizations, Urgent Care data, Pharmacy Claims, and also other registers such as the mortality and the population register. We will assess to what extent recommended interventions from evidence-based guidelines are implemented in real life and which are their effects on health outcomes. Process mining methods will be used to analyze the data, and comparison with standard methods will be also conducted.

Cohort

The cohort is defined as patients with type 2 diabetes:

  • Inclusion criteria: Patients that, at 2017-01-01 or during the follow-up from 2017-01-01 to 2022-12-31, had active health card (active TIA - tarjeta sanitaria activa) and one of the inclusion codes given in the ‘inclusion code list (’T90’ if CIAP-2, ‘250’ if ‘CIE-9CM’ or ‘E11’ if CIE-10-ES).

  • Exclusion criteria: Patients that had none of the exclusion codes given in the exclusion codes list (‘T89’ if CIAP-2, ‘250.01’ if CIE-9CM, ‘250.03’ if CIE-9CM, ‘250.11’ if CIE-9CM, ‘250.13’ if CIE-9CM, ‘250.21’ if CIE-9CM, ‘250.23’ if CIE-9CM, ‘250.31’ if CIE-9CM, ‘250.33’ if CIE-9CM, ‘250.41’ if CIE-9CM, ‘250.43’ if CIE-9CM, ‘250.51’ if CIE-9CM, ‘250.53’ if CIE-9CM, ‘250.61’ if CIE-9CM, ‘250.63’ if CIE-9CM, ‘250.71’ if CIE-9CM, ‘250.73’ if CIE-9CM, ‘250.81’ if CIE-9CM, ‘250.83’ if CIE-9CM, ‘250.91’ if CIE-9CM, ‘250.93’ if CIE-9CM or ‘E10’ if CIE-10-ES) during their follow-up or patients with no contact with the health system from 2017-01-01 to 2022-12-31.

  • Study period: 2017-01-01 until 2022-12-31.

Treatment guidelines

One of the main intermediate outcome indicators to which clinical guidelines pay special attention is a good glycaemic control, since its absence is clearly related to micro and macrovascular complications. In clinical practice, suboptimal glycaemic control can be mainly attributed to two main reasons: the patients’ non-adherence to prescribed treatment; and the healthcare providers’ clinical or therapeutic guidelines’ non-adherence.

Treatment decisions are based on glycated hemoglobin measurements. In this context the redGDPS foundation provides DM2 treatment algorithm, a diagram that aims to help professionals to quickly and easily choose the most appropriate treatment for people with diabetes.

This browser does.

Process indicators

Another measures to which different studies pay special attention are process indicators. First, process indicators are essential for assessing the three dimensions of healthcare quality: effectiveness, safety, and patient-centeredness. They are measured relatively easily and often do not require risk-adjustment, making their interpretation straightforward. Poor performance on process indicators can be directly attributed to the actions of healthcare providers, providing a clear indication for improvement, such as better adherence to clinical guidelines. Moreover, process indicators allow for the identification of areas for quality improvement, which is crucial in the complex healthcare environment. These indicators cover a wide range of health factors and help focus resources and efforts to improve the health and well-being of the population. Therefore, the attention given to health process indicators is justified by their crucial role in evaluating and improving healthcare quality and population health.

Running code

The python libraries used are:

Code
import sys
import pm4py, subprocess
import pandas as pd
import numpy as np
import itertools
import matplotlib.pyplot as plt
import textdistance
import gensim.corpora as corpora
from tqdm import trange, tqdm
from scipy.spatial.distance import squareform
from scipy.cluster.hierarchy import fcluster, linkage, dendrogram
from sklearn.cluster import AgglomerativeClustering 
from sklearn.feature_extraction.text import TfidfVectorizer
from yellowbrick.cluster import KElbowVisualizer
from pm4py.objects.petri_net.obj import PetriNet, Marking
from pm4py.visualization.decisiontree import visualizer as tree_visualizer
from pm4py.algo.decision_mining import algorithm as decision_mining
from pm4py.visualization.petri_net import visualizer
from gensim.models import LdaModel
from datetime import  datetime, timedelta
from dateutil.relativedelta import relativedelta
import duckdb
import re
import logging

The R libraries used are:

Code
library(tidyverse)
library(lubridate)
library(jsonlite)
library(ggplot2)
library(bupaR)
library(processmapR)
library(dplyr)
library(DiagrammeR)
library(DiagrammeRsvg)
library(rsvg)
library(here)
library(survival)
library(lcmm)
library(coxme)
library(muhaz)
library(ggfortify)
library(bayestestR)
library(purrr)
library(duckdb)
library(logger)
library(finalfit)
library(flextable)
library(knitr)

Data preprocessing

First of all, it is important to prepare correctly the data for the analysis. The next function creates some sql views that are going to be useful later:

Code
def general_preprocess(con):
    '''
    Generating  views
    Args:
        con : db connector variable
    '''

    con.sql("CREATE OR REPLACE VIEW cmbd_incidents_view AS SELECT * FROM main.cmbd \
            WHERE patient_id IN (SELECT patient_id FROM main.patient \
                                 WHERE dx_date >='2017-01-01')")
    con.sql("CREATE OR REPLACE VIEW ss_use_incidents_view AS SELECT * FROM main.ss_use \
            WHERE patient_id IN (SELECT patient_id FROM main.patient \
                                 WHERE dx_date >='2017-01-01')")
    con.sql("CREATE OR REPLACE VIEW comorb_incidents_view AS SELECT * FROM main.comorb \
            WHERE patient_id IN (SELECT patient_id FROM main.patient \
                                 WHERE dx_date >='2017-01-01')")
    con.sql("CREATE OR REPLACE VIEW param_incidents_view_ AS SELECT DISTINCT * FROM main.param \
            WHERE patient_id IN (SELECT patient_id FROM main.patient \
                                 WHERE dx_date >='2017-01-01') \
                AND (param_value!=0 OR param_name!='bmi')")
    con.sql("CREATE OR REPLACE VIEW param_incidents_view AS \
            SELECT DISTINCT patient_id,param_name,MAX(param_value) AS param_value,param_date\
        FROM param_incidents_view_ GROUP BY patient_id,param_name,param_date")
    con.sql("CREATE OR REPLACE VIEW param_cat_incidents_view AS SELECT DISTINCT * FROM main.param_cat \
            WHERE patient_id IN (SELECT patient_id FROM main.patient \
                                 WHERE dx_date >='2017-01-01')")
    con.sql("CREATE OR REPLACE VIEW patient_incidents_view AS SELECT *,\
            FLOOR(DATEDIFF('day',month_nac,DATE '2017-01-01') / 365.25) \
                AS 'age', \
            FLOOR(DATEDIFF('day',month_nac, dx_date) / 365.25) \
                AS 'age_dx', \
            DATE_ADD(month_nac, INTERVAL 75 YEAR) AS 'turn_to_75', \
            DATE_ADD(month_nac, INTERVAL 65 YEAR) AS 'turn_to_65' \
                FROM main.patient WHERE patient_id IN (SELECT patient_id \
                        FROM main.patient WHERE dx_date >='2017-01-01')")
    con.sql("CREATE OR REPLACE VIEW treat_incidents_view AS SELECT * FROM main.treat \
            WHERE patient_id IN (SELECT patient_id FROM main.patient \
                                 WHERE dx_date >='2017-01-01')")
    con.sql("CREATE OR REPLACE VIEW exams_incidents_view AS SELECT * FROM main.exams \
            WHERE patient_id IN (SELECT patient_id FROM main.patient \
                                 WHERE dx_date >='2017-01-01')")
                
    con.sql("CREATE OR REPLACE VIEW param_incidents_view_age_date AS SELECT param_incidents_view.patient_id, \
            param_name, param_value, param_date, turn_to_65, turn_to_75 FROM \
                param_incidents_view LEFT JOIN patient_incidents_view ON \
         param_incidents_view.patient_id = patient_incidents_view.patient_id")

    con.sql("CREATE OR REPLACE VIEW cmbd_incidents_postdx_view AS \
            SELECT * FROM cmbd_incidents_view \
                WHERE admission_date > (SELECT dx_date \
                                        FROM patient_incidents_view \
    WHERE patient_incidents_view.patient_id = cmbd_incidents_view.patient_id)")
                    
    con.sql("CREATE OR REPLACE VIEW cmbd_incidents_postdx_first_view AS \
            SELECT patient_id, admission_date, diagnosis1_code FROM (SELECT *,\
                      ROW_NUMBER() OVER(PARTITION BY patient_id \
                                        ORDER BY admission_date) AS rn \
                          FROM cmbd_incidents_postdx_view) t WHERE rn = 1;")
                
    con.sql("CREATE OR REPLACE VIEW param_incidents_predx_view AS SELECT * FROM (\
            SELECT * FROM (SELECT patient_id, param_name, param_value, param_date \
                            FROM param_incidents_view) sub \
                WHERE param_date <= (SELECT dx_date FROM patient_incidents_view \
                    WHERE patient_incidents_view.patient_id = sub.patient_id));")
        
    con.sql("CREATE OR REPLACE VIEW param_incidents_predx_last_view AS \
            SELECT patient_id, param_name, param_date, param_value \
                FROM (SELECT patient_id, param_name, param_date, param_value,\
                      ROW_NUMBER() OVER(PARTITION BY patient_id, param_name \
                                        ORDER BY param_date DESC) AS rn \
                          FROM param_incidents_predx_view \
                          WHERE param_value!=9999) t WHERE rn = 1;")
                    
    con.sql("CREATE OR REPLACE VIEW param_cat_incidents_predx_view AS SELECT * FROM (\
            SELECT * FROM (SELECT patient_id, param_cat_name, param_cat_value, param_cat_date \
                            FROM param_cat_incidents_view) sub \
                WHERE param_cat_date <= (SELECT dx_date FROM patient_incidents_view \
                    WHERE patient_incidents_view.patient_id = sub.patient_id));")
    con.sql("CREATE OR REPLACE VIEW param_cat_incidents_predx_last_view AS \
            SELECT patient_id, param_cat_name, param_cat_date, param_cat_value \
                FROM (SELECT patient_id, param_cat_name, param_cat_date, param_cat_value,\
                      ROW_NUMBER() OVER(PARTITION BY patient_id, param_cat_name \
                                        ORDER BY param_cat_date DESC) AS rn \
                          FROM param_cat_incidents_predx_view) t WHERE rn = 1;") 
                    
    con.sql("CREATE OR REPLACE VIEW param_prevalents_pre_view AS \
            SELECT DISTINCT patient_id, param_name, param_value, param_date FROM main.param \
             WHERE param_date < '2017-01-01' AND patient_id IN (\
                SELECT patient_id FROM main.patient WHERE dx_date < '2017-01-01')")
    con.sql("CREATE OR REPLACE VIEW param_prevalents_pre_last_view AS \
            SELECT patient_id, param_name, param_date, param_value \
                FROM (SELECT patient_id, param_name, param_date, param_value,\
                      ROW_NUMBER() OVER(PARTITION BY patient_id, param_name \
                                        ORDER BY param_date DESC) AS rn \
                          FROM param_prevalents_pre_view \
                              WHERE param_value!=9999) t WHERE rn = 1;")
                    
    con.sql("CREATE OR REPLACE VIEW param_cat_prevalents_pre_view AS \
            SELECT DISTINCT * FROM main.param_cat \
             WHERE param_cat_date < '2017-01-01' AND patient_id IN (\
                SELECT patient_id FROM main.patient WHERE dx_date < '2017-01-01')")
    con.sql("CREATE OR REPLACE VIEW param_cat_prevalents_pre_last_view AS \
            SELECT patient_id, param_cat_name, param_cat_date, param_cat_value \
                FROM (SELECT patient_id, param_cat_name, param_cat_date, param_cat_value,\
                      ROW_NUMBER() OVER(PARTITION BY patient_id, param_cat_name \
                                        ORDER BY param_cat_date DESC) AS rn \
                          FROM param_cat_prevalents_pre_view) t WHERE rn = 1;") 

Treatments’ event log creation

For using process mining an event log is needed. The sort of functions below take data model’s treatment and parameter tables to create an event log of glycated hemoglobin measures and treatment of patients. Glycated hemoglobin measurement events are divided into two different states, those that have a value smaller than “L” and the others. L’s value depends on patient’s diabetes duration, age and comorbilities and complications. Therefore with the help of expert clinicians we decided that L would take the following values:

  • L=7 if age<75 and no cardiovascular disease, heart failure, chronic kidney disease or fragility.
  • L=8 if age<65 and any cardiovascular disease, heart failure, chronic kidney disease or fragility.
  • L=8.5 if else.

As they are measures these events do not have any duration. In the case of treatments on the other hand a duration period exists. For treatments, events definition is based on drugs prescriptions if exists or dispensing dates and a fixed period time if else. Functions below make event logs with the previous considerations. Treatment analysis of this document is thought to do by the predominant clinical condition of patients according to DM2 treatment algorithm. In other words, patients are grouped by their predominant clinical condition and the analysis is realized independently to each group.

Code
logging.basicConfig(level=logging.INFO)
def separate_patients_by_condition(con):
    '''
    Filtering data by predominant clinical condition according to redGDPS
    Args:
        con : db connector variable
    Return:
        presc_data_p (float): percentage of not null data in prescription dates
        
    https://www.sciencedirect.com/science/article/pii/S1751991821000176?via%3Dihub#tbl0015
    https://www.sanidad.gob.es/estadEstudios/estadisticas/estadisticas/estMinisterio/SIAP/map_cie9mc_cie10_ciap2.htm
    
    https://cienciadedatosysalud.org/wp-content/uploads/Metodologia_Atlas_Prescripcion_enero_2023-1.pdf
    '''
    cvd_code_list = ['K74','K75','K76','K89','K90','K91']
    hf = 'K77'
    ckd = 'U99.01'
    ob = "T82"

    to_list = lambda l: "('"+"','".join(l)+"')"     
                    
                    
    presc_data_p = con.sql(f" SELECT (COUNT(prescription_date_ini)) / COUNT(*) \
                           AS p FROM treat_incidents_view").fetchall()[0][0]
    cie9_df = pd.read_csv('./CIE9.csv').rename(columns={'CIE9':'CIE',
                                                        'BDCAP':'CIAP'})
    cie9_df['comorb_codif'] = 'ICD9'
    cie10_df = pd.read_csv('./CIE10.csv').rename(columns={'CIE10':'CIE',
                                                          'BDCAP':'CIAP'})
    cie10_df['comorb_codif'] = 'ICD10'
    cie2ciap = pd.concat([cie9_df[['comorb_codif','CIE','CIAP']],
                          cie10_df[['comorb_codif','CIE','CIAP']]])
    
    con.sql("CREATE OR REPLACE VIEW comorb_incidents_ciap_view AS \
            SELECT * FROM comorb_incidents_view \
                LEFT JOIN cie2ciap \
                    ON comorb_incidents_view.comorb_codif = cie2ciap.comorb_codif \
                        AND comorb_incidents_view.comorb_code = cie2ciap.CIAP")
    con.sql("CREATE OR REPLACE VIEW comorb_incidents_active_view AS SELECT *, \
            CASE WHEN comorb_codif = 'CIAP2' THEN comorb_code ELSE CIAP END AS CIAP2 \
                FROM comorb_incidents_ciap_view WHERE comorb_date_end IS NULL;")

    # Which is the predominal clinical condition at baseline?
    f_list = set()
    ob_list = set(con.sql(f"SELECT DISTINCT patient_id \
                          FROM comorb_incidents_active_view WHERE CIAP2 = '{ob}'"
                          ).df()['patient_id'])        
    ckd_list = set(con.sql(f"SELECT DISTINCT patient_id \
                           FROM comorb_incidents_active_view WHERE CIAP2 = '{ckd}'"
                           ).df()['patient_id'])        
    hf_list = set(con.sql(f"SELECT DISTINCT patient_id \
                          FROM comorb_incidents_active_view WHERE CIAP2 = '{hf}'"
                          ).df()['patient_id'])        
    cvd_list = set(con.sql(f"SELECT DISTINCT patient_id \
                           FROM comorb_incidents_active_view \
                               WHERE CIAP2 IN {to_list(cvd_code_list)}"
                               ).df()['patient_id'])        

    # Which is the predominal clinical condition at dx_date?
    # ENFERMEDADES ISQUÉMICAS CARDIACAS (I20-I25)
    # ENFERMEDADES CEREBROVASCULARES (I60-I69)
    # ISQUEMIA CEREBRAL TRANSITORIA (G45)
    # ICTUS (G46)
    to_cvd_change_prelist = ['I2%i' %i for i in range(6)]+[
                             'I6%i' %i for i in range(10)]+[
                              'G45','G46']
    to_cvd_change_list = []
    for code in to_cvd_change_prelist:
     for c in cie10_df['CIE']:
      if code in c:
                to_cvd_change_list.append(c)
    # INSUFICIENCIA CARDÍACA (I50)
    # ENFERMEDAD CARDÍACA Y RENAL CRÓNICA HIPERTENSIVA (I13)
    to_hf_change_list = ['I09.81', 'I11.0', 'I13.0', 'I13.2', 'I50', 'I50.0',
                         'I50.1', 'I50.9']
    # ENFERMEDAD RENAL CRÓNICA (N18)
    # ENFERMEDAD RENAL CRÓNICA HIPERTENSIVA (I12)
    to_ckd_change_list = ['N18','N18.1','N18.2','N18.3','N18.30','N18.31',
                          'N18.32','N18.4','N18.5','N18.6','N18.9','I12',
                          'I12.0','I12.9','filtglom']
    to_f_change_list = ['f_date']
    to_ob_change_list = ['bmi>=30','E66.01','E66.09','E66.1','E66.2',
                         'E66.8','E66.9']
    # SOBREPESO Y OBESIDAD (E66)
    from_cvd_change_list = ['death']
    from_hf_change_list = from_cvd_change_list+to_cvd_change_list
    from_ckd_change_list = from_hf_change_list+to_hf_change_list
    from_f_change_list = from_ckd_change_list+to_ckd_change_list
    from_ob_change_list = from_f_change_list+to_f_change_list+['bmi<30']
    from_else_change_list = from_ob_change_list[:-1]+to_ob_change_list
    
    relevance_order = pd.DataFrame()
    relevance_order['code'] = from_ob_change_list+to_ob_change_list+['end']
    relevance_order['order'] = range(len(relevance_order))

    
       
    # https://ukkidney.org/health-professionals/information-resources/uk-eckd-guide/ckd-stages
    # Include obesity and kcd detection by parameters values                             
    param_filt = con.sql("SELECT * FROM param_incidents_view \
                         WHERE param_name = 'filtglom' AND param_value<60;"
                         ).df().sort_values(['patient_id','param_date'])
    param_filt['time_diff'] = param_filt.groupby(
        'patient_id')['param_date'].diff()
    cmbd_filt = param_filt[param_filt['time_diff'] < pd.Timedelta(days=90)
                           ].drop_duplicates(subset='patient_id',keep='first')[
                               ['patient_id','param_date','param_name']
                               ].rename(columns={
                                   'param_name':'code',
                                   'param_date':'admission_date'})
    con.sql("CREATE OR REPLACE VIEW cmbd_incidents_long_view AS SELECT *,\
            COALESCE(t.order, 404) AS relevance FROM\
            (SELECT patient_id, admission_date, diagnosis1_code AS code \
                FROM cmbd_incidents_view WHERE diagnosis1_code IS NOT NULL \
            UNION ALL SELECT patient_id, admission_date, diagnosis2_code AS code \
                FROM cmbd_incidents_view WHERE diagnosis2_code IS NOT NULL \
            UNION ALL SELECT patient_id, admission_date, diagnosis3_code AS code \
                FROM cmbd_incidents_view WHERE diagnosis3_code IS NOT NULL \
            UNION ALL SELECT patient_id, admission_date, code FROM cmbd_filt \
            UNION ALL SELECT patient_id, turn_to_75 AS admission_date, 'f_date' \
                AS code FROM patient_incidents_view WHERE turn_to_75<'2023-01-01'\
            UNION ALL SELECT patient_id, death_date AS admission_date, 'death' \
                AS code FROM patient_incidents_view WHERE death_date IS NOT NULL\
            UNION ALL SELECT patient_id, DATE '2023-01-01' AS admission_date, 'end' \
                AS code FROM patient_incidents_view\
            UNION ALL SELECT patient_id,param_date AS admission_date, \
        CASE WHEN param_value < 30 THEN 'bmi<30' ELSE 'bmi>=30' END AS code \
            FROM param_incidents_predx_last_view WHERE param_name = 'bmi' \
            UNION ALL SELECT patient_id, param_date AS admission_date, \
        CASE WHEN param_value < 30 THEN 'bmi<30' ELSE 'bmi>=30' END AS code \
            FROM param_incidents_view sub WHERE param_name = 'bmi' \
                AND  param_date > (SELECT dx_date FROM patient_incidents_view \
                                   WHERE patient_incidents_view.patient_id = sub.patient_id)) v \
                LEFT JOIN relevance_order t ON v.code=t.code;")
    
    f_list.update(set(con.sql("SELECT DISTINCT patient_id \
                              FROM patient_incidents_view WHERE turn_to_75<=dx_date"
                              ).df()['patient_id']))
    con.sql(f"CREATE OR REPLACE VIEW ob_events AS SELECT * \
            FROM cmbd_incidents_long_view sub \
                WHERE code IN {to_list(to_ob_change_list)} \
                    AND admission_date <= (SELECT dx_date \
                    FROM patient_incidents_view \
                        WHERE patient_incidents_view.patient_id = sub.patient_id);"
                            )
    ob_list.update(set(con.sql("SELECT DISTINCT patient_id \
                           FROM ob_events").df()['patient_id']))
    #Include patients with bmi>=30 in the first 21 days and no bmi data before dx 
    ob_list.update(set(con.sql(
        f"SELECT DISTINCT patient_id \
        FROM cmbd_incidents_long_view sub \
        WHERE code IN {to_list(to_ob_change_list)} \
            AND DATEDIFF('day',(SELECT dx_date \
            FROM patient_incidents_view \
                WHERE patient_incidents_view.patient_id = sub.patient_id),\
                         admission_date)<21 \
            AND patient_id NOT IN (SELECT DISTINCT patient_id \
                                   FROM ob_events);").df()['patient_id']))
                
    ob_list = ob_list - set(con.sql(f"SELECT DISTINCT patient_id \
                           FROM cmbd_incidents_long_view sub \
                               WHERE code='bmi<30' \
                                   AND admission_date <= (SELECT dx_date \
                        FROM patient_incidents_view \
                            WHERE patient_incidents_view.patient_id = sub.patient_id) \
                                   AND admission_date > (SELECT MAX(admission_date)\
                        FROM ob_events \
                            WHERE ob_events.patient_id  = sub.patient_id);"
                                   ).df()['patient_id'])
    ckd_list.update(set(con.sql(f"SELECT DISTINCT patient_id \
                           FROM cmbd_incidents_long_view sub \
                               WHERE code IN {to_list(to_ckd_change_list)} \
                                   AND admission_date <= (SELECT dx_date \
                        FROM patient_incidents_view \
                            WHERE patient_incidents_view.patient_id = sub.patient_id);"
                            ).df()['patient_id']))        
    hf_list.update(set(con.sql(f"SELECT DISTINCT patient_id \
                           FROM cmbd_incidents_long_view sub \
                               WHERE code IN {to_list(to_hf_change_list)} \
                                   AND admission_date <= (SELECT dx_date \
                        FROM patient_incidents_view \
                            WHERE patient_incidents_view.patient_id = sub.patient_id);"
                            ).df()['patient_id']))        
    cvd_list.update(set(con.sql(f"SELECT DISTINCT patient_id \
                           FROM cmbd_incidents_long_view sub \
                               WHERE code IN {to_list(to_cvd_change_list)} \
                                   AND admission_date <= (SELECT dx_date \
                        FROM patient_incidents_view \
                            WHERE patient_incidents_view.patient_id = sub.patient_id);"
                            ).df()['patient_id']))        
      
    
    cvd_list = list(cvd_list)
    hf_list = list(hf_list.difference(cvd_list))
    ckd_list = list(ckd_list.difference(cvd_list+hf_list))
    f_list = list(f_list.difference(cvd_list+hf_list+ckd_list))
    ob_list = list(ob_list.difference(cvd_list+hf_list+ckd_list+f_list))
    else_list = list(set(con.sql("SELECT DISTINCT patient_id \
                                 FROM patient_incidents_view"
                                 ).df()['patient_id']
                         ).difference(cvd_list+hf_list+ckd_list+f_list+ob_list))
               
    # When is a change on predominant clinical condition made?  
    con.sql(f"CREATE OR REPLACE VIEW cmbd_incidents_long_view_dx AS SELECT * \
            FROM cmbd_incidents_long_view sub \
                WHERE admission_date > (SELECT dx_date \
                    FROM patient_incidents_view \
                        WHERE patient_incidents_view.patient_id = sub.patient_id);")

    query_function =  lambda cond: "(WITH rankedevents_*** AS (\
        SELECT patient_id,code,admission_date,ROW_NUMBER() \
          OVER(PARTITION BY patient_id ORDER BY admission_date,relevance) AS event_rank \
              FROM cmbd_incidents_long_view_dx WHERE  patient_id IN {to_list(***_list)} \
                  AND code IN {to_list(from_***_change_list)} ) \
            SELECT patient_id, '***' AS type,code, admission_date AS date \
                FROM rankedevents_*** WHERE  event_rank=1 AND date<'2023-01-01') \
                                    UNION ALL (WITH rankedevents__*** AS (\
        SELECT patient_id,code,admission_date,ROW_NUMBER() \
          OVER(PARTITION BY patient_id ORDER BY admission_date) AS event_rank \
              FROM cmbd_incidents_long_view_dx WHERE  patient_id IN {to_list(***_list)} \
                  AND code IN {to_list(['end']+from_***_change_list)} ) \
            SELECT patient_id, '***' AS type,code, admission_date AS date \
                FROM rankedevents__*** WHERE  event_rank=1 AND code='end')\
                ".replace('***',cond)
    con.sql(eval('f"CREATE OR REPLACE TABLE patient_condition AS {}"'.format(' UNION ALL '.join(
        [query_function(cond) for cond in ['cvd','hf','ckd','f','ob','else']]))))    
    return presc_data_p
                

def evlog_creation_by_prescriptions(con,
                                 cond,
                                 code2drug_info_path='./diabetes_drugs.csv'):
    '''Preprocessing and event log obtention with prescriptions
  
    Args:
      con : db connector variable
      cond (str): predominal clinical condition's code
      code2drug_info_path (str): drugs' and their codes' info's table's path
  
    ADNI: ANTIDIABETICOS NO INSULINICOS
  
    The treatment of type 2 diabetes mellitus with ADNI includes a wide range of 
    drugs which, depending on their drugs which, according to their mechanisms of 
    action, can be grouped as follows: 
     Increased endogenous insulin sensitivity:
        o Biguanides: metformin (MET).
        o Thiazolidinediones: pioglitazone (PIO).
     Increased endogenous insulin secretion/release:
        o Sulfonylureas (SU).
        o Meglitinides: repaglinide (REP)
     Reduction of digestive glucose absorption:
        o Alpha-glucosidase inhibitors.
        o Vegetable fiber and derivatives.
     Incretin effect enhancers.
        o Inhibitors of DPP-4 (iDPP-4).
        o GLP-1 analogues (aGLP-1).
     Inhibitors of renal glucose reuptake
        o SGLT-2 inhibitors (iSGLP-2)

    '''
    fin_date = con.sql(f"SELECT patient_id,date FROM patient_condition \
                       WHERE type = '{cond}'").df()
    dx_date = con.sql(f"SELECT patient_id, dx_date FROM  patient_incidents_view \
                      WHERE patient_id IN (SELECT patient_id \
                         FROM patient_condition \
                             WHERE patient_condition.type = '{cond}')").df()
    dx_date = dict(zip(dx_date.patient_id,dx_date.dx_date))
    fin_date = dict(zip(fin_date.patient_id,fin_date.date))
    code2drug = pd.read_csv(code2drug_info_path,index_col=0).to_dict()
    code2drug_f = lambda atc: code2drug.get(atc,{'class':'NONE'}
                                           ).get('class','NONE2'
                                                 ).replace('+','&')
    treat_df = con.sql(f"SELECT * FROM treat_incidents_view \
                       WHERE patient_id IN (SELECT patient_id \
                                            FROM patient_condition \
                                                WHERE patient_condition.type = '{cond}') \
                           AND substring(atc_code,1,3) = 'A10' \
                           AND prescription_date_ini IS NOT NULL").df()
    treat_df['Event'] = treat_df['atc_code'].apply(code2drug_f)
    treat_df = treat_df[~treat_df['Event'].isin(['NONE','NONE2'])
                        ].drop_duplicates(subset=['patient_id',
                                                  'prescription_date_ini',
                                                  'prescription_date_end',
                                                  'Event'])
    treat_df['actins'] = range(len(treat_df))
    treat_df_start = treat_df[['patient_id','prescription_date_ini',
                              'Event','actins']].rename(columns = 
                                       {'prescription_date_ini':'date'})
    treat_df_end = treat_df[['patient_id','prescription_date_end',
                              'Event','actins']].rename(columns = 
                                       {'prescription_date_end':'date'})
    
    treat_df_start['cycle'] = 'start'
    treat_df_end['cycle'] = 'end'
    treat_df_end['date'] = treat_df_end['date'].fillna(
        datetime.strptime('2022-12-31', "%Y-%m-%d")).apply(
    lambda x: min([x,datetime.strptime('2023-01-01', "%Y-%m-%d")])-timedelta(days=1))
            
    act_n = max(treat_df['actins'])
    if cond in ['cvd','hf','ckd','f']:
        param_df = con.sql(f"SELECT patient_id,param_date AS date, \
                           CASE WHEN param_value < {8} THEN 'HBA<L' \
                               WHEN param_value >= {8} AND param_value < {8.5} \
                                   AND param_date < turn_to_65 THEN 'HBA>L' \
                               WHEN param_value >= {8} AND param_value < {8.5} \
                                   AND param_date >= turn_to_65 THEN 'HBA<L' \
                               ELSE 'HBA>L' END AS Event,\
                           ROW_NUMBER() OVER (ORDER BY param_date) + {act_n} AS actins\
                               FROM param_incidents_view_age_date \
                                   WHERE patient_id IN (SELECT patient_id \
                                    FROM patient_condition \
                                        WHERE patient_condition.type = '{cond}') \
                                    AND param_name = 'hba1c' \
                                    AND param_value IS NOT NULL \
                                    AND param_date IS NOT NULL").df()
    else:
        param_df = con.sql(f"SELECT patient_id,param_date AS date, \
                           CASE WHEN param_value < {7} THEN 'HBA<L' \
                               WHEN param_value >= {7} AND param_value < {8.5} \
                                   AND param_date < turn_to_75 THEN 'HBA>L' \
                               WHEN param_value >= {7} AND param_value < {8.5} \
                                   AND param_date >= turn_to_75 THEN 'HBA<L' \
                               ELSE 'HBA>L' END AS Event,\
                           ROW_NUMBER() OVER (ORDER BY param_date) + {act_n} AS actins\
                               FROM param_incidents_view_age_date \
                                   WHERE patient_id IN (SELECT patient_id \
                                    FROM patient_condition \
                                        WHERE patient_condition.type = '{cond}') \
                                    AND param_name = 'hba1c' \
                                    AND param_value IS NOT NULL \
                                    AND param_date IS NOT NULL").df()        

    param_df_start = param_df.copy()
    param_df_start['cycle'] = 'start'
    param_df_end = param_df.copy()
    param_df_end['cycle'] = 'end'

    df = pd.concat([param_df_start,treat_df_start,param_df_end,treat_df_end]
                   ).drop_duplicates()
    
    df = df.sort_values(by = ['patient_id','date','Event','cycle'],
                        ascending = [True,True,True,False])
    df.index = range(len(df))
    patient_list = list(set(df['patient_id']))
    df['nid'] = [patient_list.index(df['patient_id'][n]) for n in range(len(df))]

    #####################################################################
    
    event_log = dict()
    event_log['patient_id'] = []
    event_log['date'] = []
    event_log['nid'] = []
    event_log['Event'] = []
    event_log['cycle'] = []
    days_after_m = 5
    id_list = list(set(df['patient_id']))
    events = list(set([drug for e in set(df['Event']) for drug in e.split('&')]))
    hba0, hba1 = events.index('HBA<L'), events.index('HBA>L')
    row = len(events)
    no_hba = [i for i in range(row) if i not in [hba0,hba1]]
    for id in tqdm(id_list):
        df_id = df[df['patient_id']==id]
        nid = id_list.index(id)
        actins = set(df_id['actins'])
        date_min = min([min(set(df_id['date'])),dx_date[id]])
        date_max =  max(set(df_id['date']))
        if str(fin_date[id])!='nan':
            date_max = max([date_max,fin_date[id]])
        else:
            date_max = max([date_max,datetime.strptime('2023-01-01', "%Y-%m-%d")])
        dd = [date_min + timedelta(days=x) for x in range(
          (date_max-date_min).days + 1)]
        col = len(dd)
        ev_status = np.zeros((row,col))
        for act in actins:
            ini = list(df_id['date'][np.logical_and(df_id['actins']==act,
                                                  df_id['cycle']=='start')])[0]
            ev =  list(df_id['Event'][df_id['actins']==act])[0]            
            fin = list(df_id['date'][np.logical_and(df_id['actins']==act,
                                                  df_id['cycle']=='end')])[0]
            ev_cols = [events.index(ev_) for ev_ in ev.split('&')]
            ev_status = change_matrix_values(ev_status,ev_cols,
                                             list(range(dd.index(ini),
                                                        dd.index(fin)+1)),1)
                            
        measures = list(np.concatenate((ev_status[hba0,:].nonzero()[0],
                                        ev_status[hba1,:].nonzero()[0])))
        ev_status = ev_status.astype(int)              
        #delete treatments in measure events        
        ev_status = change_matrix_values(ev_status,no_hba,measures,0)
        #correction to eliminate events after hemoglobin measurement that are
        #the continuation of the previous treatment and not the new treatment.
        #Example: If a patient has 'A' treatment and because of hemoglobin's
        #         measure they change their treatment to 'B', originally their
        #         trace could appear as A>measure>A>B, but the true trace
        #         should be A>measure>B. Therefore, in a fixed period time 
        #         after each measurement 'days_after_m', if we detect that 
        #         mistake we delete it. 

        for j in measures:
            if ev_status.shape[1]<=j+1 or days_after_m<3:
                continue
            first_col = ev_status[:,j+1]
            dif_col = np.where(np.any(ev_status[:, j+2:j+days_after_m]!=
                                      first_col[:, np.newaxis],
                                      axis=0))[0]
          
            if  (dif_col.size > 0) and (not
                    np.array_equal(ev_status[:,j-1],
                                    ev_status[:,j+dif_col[0]+2])) and (
                    np.array_equal(ev_status[:,j-1],
                                    ev_status[:,j+1])):
                ev_status = change_matrix_values(ev_status,no_hba,
                                                  range(j+1,j+dif_col[0]+2),0)

          
        col = dd.index(dx_date[id])
        col_0 = dd.index(dx_date[id])
        col_max = dd.index(fin_date[id]) if str(fin_date[id])!='nan' else len(dd)
        min_change_days = 7
        while col<col_max-1:
            if not np.array_equal(ev_status[:,col],ev_status[:,col+1]):
                ev = col2treat(ev_status,events,col_0)
                if  ('HBA' not in ev and col-col_0<min_change_days):
                    col+=1
                    col_0 = col
                    continue
                event_log['patient_id'].extend([id,id])
                event_log['date'].extend([dd[col_0],dd[col]])
                event_log['Event'].extend([ev,ev])
                event_log['nid'].extend([nid,nid])
                event_log['cycle'].extend(['start','end'])
                
                col+=1
                col_0 = col
            else:
                col+=1
        ev = col2treat(ev_status,events,col_0)
        if 'HBA' in ev or col-col_0>=min_change_days:
            event_log['patient_id'].extend([id,id])
            event_log['date'].extend([dd[col_0],dd[col]])
            event_log['Event'].extend([ev,ev])
            event_log['nid'].extend([nid,nid])
            event_log['cycle'].extend(['start','end'])
    
            
    evlog = pd.DataFrame.from_dict(event_log)   
    evlog = evlog.sort_values(['patient_id','date'])
    evlog.rename(columns = {'patient_id':'ID'}, inplace = True)
    evlog.index = range(len(evlog))
    evlog['cycle'] = ['start','end']*int(0.5*len(evlog))
    evlog['actins'] = [n//2 for n in range(len(evlog))]
    con.sql(f"CREATE OR REPLACE TABLE evlog_raw_{cond} AS SELECT * FROM evlog")
 
def evlog_creation_by_dispensations(con, 
                              cond,
                              code2drug_info_path='./diabetes_drugs.csv',
                              nac_path='./Nomenclator_de_Facturacion.csv'):
    '''Preprocessing and event log obtention with dispensations

    Args:
      con : db connector variable
      cond (str): predominal clinical condition's code
      code2drug_info_path (str): drugs' and their codes' info's table's path
      nac_path (str): drugs' and their containers' info's table's path
    
    ADNI: ANTIDIABETICOS NO INSULINICOS
    
    The treatment of type 2 diabetes mellitus with ADNI includes a wide range of 
    drugs which, depending on their drugs which, according to their mechanisms of 
    action, can be grouped as follows: 
     Increased endogenous insulin sensitivity:
        o Biguanides: metformin (MET).
        o Thiazolidinediones: pioglitazone (PIO).
     Increased endogenous insulin secretion/release:
        o Sulfonylureas (SU).
        o Meglitinides: repaglinide (REP)
     Reduction of digestive glucose absorption:
        o Alpha-glucosidase inhibitors.
        o Vegetable fiber and derivatives.
     Incretin effect enhancers.
        o Inhibitors of DPP-4 (iDPP-4).
        o GLP-1 analogues (aGLP-1).
     Inhibitors of renal glucose reuptake
        o SGLT-2 inhibitors (iSGLP-2)
    
    '''

    treat_min_days = 2
    days_error = 60-treat_min_days
    days_after_m = 7
    fin_date = con.sql(f"SELECT patient_id,date FROM patient_condition \
                       WHERE type = '{cond}'").df()
    dx_date = con.sql(f"SELECT patient_id, dx_date FROM  patient_incidents_view \
                      WHERE patient_id IN (SELECT patient_id \
                         FROM patient_condition \
                             WHERE patient_condition.type = '{cond}')").df()
    dx_date = dict(zip(dx_date.patient_id,dx_date.dx_date))
    fin_date = dict(zip(fin_date.patient_id,fin_date.date))
    code2drug = pd.read_csv(code2drug_info_path,index_col=0).to_dict()
    code2drug_f = lambda atc: code2drug.get(atc,{'class':'NONE'}
                                           ).get('class','NONE2'
                                                 ).replace('+','&')
    treat_df = con.sql(f"SELECT * FROM treat_incidents_view \
                       WHERE patient_id IN (SELECT patient_id \
                                            FROM patient_condition \
                                                WHERE patient_condition.type = '{cond}') \
                           AND substring(atc_code,1,3) = 'A10' \
                           AND dispensing_date IS NOT NULL").df()
    treat_df['Event'] = treat_df['atc_code'].apply(code2drug_f)
    treat_df = treat_df[~treat_df['Event'].isin(['NONE','NONE2'])]
    nac_dict = nac2comprimidos(nac_path,set(treat_df['nac_code']))
    treat_df['nac_code'] = treat_df['nac_code'].apply(
        lambda x: nac_dict.get(x,days_error))
    treat_df['containers_number'] = treat_df['containers_number'].fillna(1)
    treat_df['nac_code'] = treat_df['nac_code']*treat_df['containers_number']
    treat_df = treat_df[['patient_id','dispensing_date','Event','nac_code']
                        ].rename(columns = {'dispensing_date':'date'}
                                 ).sort_values(by = ['patient_id','date' ],
                      ascending = [True, True])
                     
    if cond in ['cvd','hf','ckd','f']:
        df = con.sql(f"SELECT *, 'start' AS cycle,\
                     ROW_NUMBER() OVER (ORDER BY patient_id, date) AS actins \
                         FROM (SELECT patient_id,param_date AS date, \
                           CASE WHEN param_value < {8} THEN 'HBA<L' \
                                WHEN param_value >= {8} AND param_value < {8.5} \
                                   AND param_date < turn_to_65 THEN 'HBA>L' \
                                WHEN param_value >= {8} AND param_value < {8.5} \
                                   AND param_date >= turn_to_65 THEN 'HBA<L' \
                                ELSE 'HBA>L' END AS Event, 0 AS nac_code\
                               FROM param_incidents_view_age_date \
                                   WHERE patient_id IN (SELECT patient_id \
                                    FROM patient_condition \
                                        WHERE patient_condition.type = '{cond}') \
                                       AND param_name = 'hba1c' \
                                    AND param_value IS NOT NULL \
                                    AND param_date IS NOT NULL\
                                UNION ALL SELECT * FROM treat_df) \
                    UNION ALL SELECT patient_id, date, Event, nac_code, 'end' AS cycle,\
                     ROW_NUMBER() OVER (ORDER BY patient_id, param_date) AS actins \
                         FROM (SELECT patient_id,param_date AS date, param_date,\
                           CASE WHEN param_value < {8} THEN 'HBA<L' \
                                WHEN param_value >= {8} AND param_value < {8.5} \
                                   AND param_date < turn_to_65 THEN 'HBA>L' \
                                WHEN param_value >= {8} AND param_value < {8.5} \
                                   AND param_date >= turn_to_65 THEN 'HBA<L' \
                                ELSE 'HBA>L' END AS Event, 0 AS nac_code\
                               FROM param_incidents_view_age_date \
                                   WHERE patient_id IN (SELECT patient_id \
                                    FROM patient_condition \
                                        WHERE patient_condition.type = '{cond}') \
                                       AND param_name = 'hba1c' \
                                    AND param_value IS NOT NULL \
                                    AND param_date IS NOT NULL\
                                UNION ALL SELECT patient_id,\
                                CAST(date AS TIMESTAMP)+INTERVAL {treat_min_days} \
                                    DAYS AS date, date AS param_date, \
                                        Event, nac_code FROM treat_df)").df()
    else:
        df = con.sql(f"SELECT *, 'start' AS cycle,\
                     ROW_NUMBER() OVER (ORDER BY patient_id, date) AS actins \
                         FROM (SELECT patient_id,param_date AS date, \
                           CASE WHEN param_value < {7} THEN 'HBA<L' \
                                WHEN param_value >= {7} AND param_value < {8.5} \
                                   AND param_date < turn_to_75 THEN 'HBA>L' \
                                WHEN param_value >= {7} AND param_value < {8.5} \
                                   AND param_date >= turn_to_75 THEN 'HBA<L' \
                                ELSE 'HBA>L' END AS Event, 0 AS nac_code\
                               FROM param_incidents_view_age_date \
                                   WHERE patient_id IN (SELECT patient_id \
                                    FROM patient_condition \
                                        WHERE patient_condition.type = '{cond}') \
                                       AND param_name = 'hba1c' \
                                    AND param_value IS NOT NULL \
                                UNION ALL SELECT * FROM treat_df) \
                    UNION ALL SELECT patient_id, date, Event, nac_code, 'end' AS cycle,\
                     ROW_NUMBER() OVER (ORDER BY patient_id, param_date) AS actins \
                         FROM (SELECT patient_id,param_date AS date, param_date,\
                           CASE WHEN param_value < {7} THEN 'HBA<L' \
                                WHEN param_value >= {7} AND param_value < {8.5} \
                                   AND param_date < turn_to_75 THEN 'HBA>L' \
                                WHEN param_value >= {7} AND param_value < {8.5} \
                                   AND param_date >= turn_to_75 THEN 'HBA<L' \
                                ELSE 'HBA>L' END AS Event, 0 AS nac_code\
                               FROM param_incidents_view_age_date \
                                   WHERE patient_id IN (SELECT patient_id \
                                    FROM patient_condition \
                                        WHERE patient_condition.type = '{cond}') \
                                       AND param_name = 'hba1c' \
                                    AND param_value IS NOT NULL \
                                UNION ALL SELECT patient_id,\
                                CAST(date AS TIMESTAMP)+INTERVAL {treat_min_days} \
                                    DAYS AS date, date AS param_date, \
                                        Event, nac_code FROM treat_df)").df()

    patient_list = list(set(df['patient_id']))
    df['nid'] = [patient_list.index(df['patient_id'][n]) for n in range(len(df))] 
    df = df.sort_values(by = ['patient_id', 'actins','cycle'],
                        ascending = [True, True, False])
    df.index = range(len(df))
    
    #####################################################################

    event_log = dict()
    event_log['patient_id'] = []
    event_log['date'] = []
    event_log['nid'] = []
    event_log['Event'] = []
    event_log['cycle'] = []
    
    id_list = list(set(df['patient_id']))
    events = list(set([drug for e in set(df['Event']) for drug in e.split('&')]))
    hba0, hba1 = events.index('HBA<L'), events.index('HBA>L')
    row = len(events)
    no_hba = [i for i in range(row) if i not in [hba0,hba1]]
    for id in tqdm(id_list):
        df_id = df[df['patient_id']==id]
        nid = id_list.index(id)
        actins = set(df_id['actins'])
        date_min = min([min(set(df_id['date'])),dx_date[id]])
        date_max =  max(set(df_id['date']))
        if str(fin_date[id])!='nan':
            date_max = max([date_max,fin_date[id]])
        else:
            date_max = max([date_max,datetime.strptime('2023-01-01', "%Y-%m-%d")])
        dd = [date_min + timedelta(days=x) for x in range(
          (date_max-date_min).days + 1)]
        col = len(dd)
        ev_status = np.zeros((row,col))
        max_nac = int(max(df_id['nac_code'])*1.2)
        #event matrix where columns are days and rows treatments
        #0:no treatment, 1:treatment, 2:posible treatment,
        #3:after measuring period
        for act in actins:
            ini = list(df_id['date'][np.logical_and(df_id['actins']==act,
                                                  df_id['cycle']=='start')])[0]
            ev =  list(df_id['Event'][df_id['actins']==act])[0]            
            fin = ini+timedelta(days=treat_min_days) if 'HBA' not in ev else ini
            ev_cols = [events.index(ev_) for ev_ in ev.split('&')]
            ev_status = change_matrix_values(ev_status,ev_cols,
                                             list(range(dd.index(ini),
                                                        dd.index(fin)+1)),1)
            if 'HBA' not in ev:
                possible_day_duration = int(list(
                    df_id['nac_code'][np.logical_and(df_id['actins']==act,
                    df_id['cycle']=='start')])[0]*1.2)
                until_day = min([dd.index(ini)+possible_day_duration,col])
                ev_status = change_matrix_values(ev_status,ev_cols,
                                                 list(range(dd.index(fin),
                                                            until_day)),2)
                            
                                        
                
                
        measures = list(np.concatenate((ev_status[hba0,:].nonzero()[0],
                                        ev_status[hba1,:].nonzero()[0])))
        #event compaction.
        #Example: If a patient has 'A' treatment and they are dispensing 'A'
        #         drug multiple times, their trace could appear as A>A>A>A, 
        #         but the true trace should appear as 'A' treatment lasting
        #         its corresponding time. Therefore, if the difference between
        #         the last day and the first day of two consecutive same drugs's
        #         dispensation is less than the number of tablets of the 
        #         container two events are jointed in one with the first day of
        #         the first event as initial date and last day of the last 
        #         event as the end date.
        ev_status = ev_status.astype(int)
        for i in no_hba:
            seq = str(list(ev_status[i,:]))
            for delta_days in range(1,max_nac):
                pattern = ', '+str([1]+[2]*delta_days+[1])[1:-1]
                new_pattern = ', '+str([1]+[1]*delta_days+[1])[1:-1]
                seq = seq.replace(pattern,new_pattern)
            ev_status[i,:] = eval(seq) 
              
        ev_status = change_matrix_values(ev_status,no_hba,measures,3)
        for i,j in list(itertools.combinations(no_hba,2)):
            seq_i = str(list(ev_status[i,:]))
            seq_j = str(list(ev_status[j,:]))
            for delta_days in range(1,max_nac):
                pattern_i = ', '+str([1]+[1]*delta_days+[3])[1:-1]
                if re.search(pattern_i,seq_i)==None:
                    continue
                pattern_j = ', '+str([1]+[2]*delta_days+[3])[1:-1]
                new_pattern_j = ', '+str([1]+[1]*delta_days+[3])[1:-1]
                seq_j = seq_j.replace(pattern_j,new_pattern_j)
            for delta_days in range(1,max_nac):
                pattern_j = ', '+str([1]+[1]*delta_days+[3])[1:-1]
                if re.search(pattern_j,seq_j)==None:
                    continue
                pattern_i = ', '+str([1]+[2]*delta_days+[3])[1:-1]
                new_pattern_i = ', '+str([1]+[1]*delta_days+[3])[1:-1]
                seq_i = seq_i.replace(pattern_i,new_pattern_i)
            ev_status[i,:] = eval(seq_i)
            ev_status[j,:] = eval(seq_j)

        measures_w_dp = [[i+d_p for d_p in range(days_after_m+1)
                          if i+d_p<col] for i in measures]    
        measures_w_dp = list(itertools.chain(*measures_w_dp))
        ev_status = change_matrix_values(ev_status,no_hba,measures_w_dp,3)

        for i in no_hba:
            seq = str(list(ev_status[i,:]))
            seq = seq.replace('2','0').replace('3','0')
            ev_status[i,:] = eval(seq)               
                
        col = dd.index(dx_date[id])
        col_0 = dd.index(dx_date[id])
        col_max = dd.index(fin_date[id]) if str(fin_date[id])!='nan' else len(dd)
        min_change_days = 42
        while col<col_max-1:
            if not np.array_equal(ev_status[:,col],ev_status[:,col+1]):
                ev = col2treat(ev_status,events,col_0)
                if  (ev=='_' and col-col_0<min_change_days):
                    col+=1
                    col_0 = col
                    continue
                event_log['patient_id'].extend([id,id])
                event_log['date'].extend([dd[col_0],dd[col]])
                event_log['Event'].extend([ev,ev])
                event_log['nid'].extend([nid,nid])
                event_log['cycle'].extend(['start','end'])
                
                col+=1
                col_0 = col
            else:
                col+=1
        ev = col2treat(ev_status,events,col_0)
        if ev!='_' or col-col_0>=min_change_days:
            event_log['patient_id'].extend([id,id])
            event_log['date'].extend([dd[col_0],dd[col]])
            event_log['Event'].extend([ev,ev])
            event_log['nid'].extend([nid,nid])
            event_log['cycle'].extend(['start','end'])
    
            
    evlog = pd.DataFrame.from_dict(event_log)   
    evlog = evlog.sort_values(['patient_id','date'])
    evlog.rename(columns = {'patient_id':'ID'}, inplace = True)
    evlog.index = range(len(evlog))
    evlog['cycle'] = ['start','end']*int(0.5*len(evlog))
    evlog['actins'] = [n//2 for n in range(len(evlog))]
    con.sql(f"CREATE OR REPLACE TABLE evlog_raw_{cond} AS SELECT * FROM evlog")
  
def change_matrix_values(matrix,list_rows,list_cols, new_value=0):
    '''change selected rows' and columns' matrix`s values
    Args:
      matrix (array): matrix wanted to modify
      list_rows (list): matrix's rows wanted to modify
      list_cols (list): matrix's columns wanted to modify
      new_value (int): new wanted value
      
    Output:
      matrix (array): modified matrix
    '''
    pos = list(itertools.product(list_rows,list_cols))
    for i,j in pos:
        matrix[i,j] = new_value
    return matrix

def col2treat(matrix,rows,col):
    '''translate treatments from a binary vector
  
    Args:
      matrix (array): events calendary in binary information
      rows (str): events list
      col (int): column of matrix wanted to translate
    
    Output:
      treatment (str):
    '''
    treatment = '+'.join(sorted([rows[ev
                            ] for ev in range(len(rows)) 
                                 if matrix[ev,col]!=0]))
    if treatment!='':
        return treatment
    else:
        return '_'
    
def nac2comprimidos(nac_path,code_set):
    '''Creating dictionary of nac drugs container's number of tablets
  
    Args:
      nac_path (str): nac table's path
      code_set (set): nac code list
    Outputs:
        nac_dict (dic): nac codes as keys and their number of tablets as values
    '''

    nac = pd.read_csv(nac_path)
    nac['Código Nacional'] = nac['Código Nacional'].astype(str)
    nac_dict = {}

    for n in range(len(nac)):
        if nac['Código Nacional'][n] in code_set:
            text = nac['Nombre del producto farmacéutico'][n].lower()
            if re.search(r'\d+(?=\s*comprim)', text)!=None:
                nac_dict[nac['Código Nacional'][n]
                         ] = int(re.search(r'\d+(?=\s*comprim)', text).group())
            elif re.search(r'\d+(?=\s*comprim)', 
                           re.sub(r'\([^()]*\)', '', text))!=None:
                nac_dict[nac['Código Nacional'][n]
                         ] = int(re.search(r'\d+(?=\s*comprim)', 
                                re.sub(r'\([^()]*\)', '', text)).group())          
            elif re.search(r'con pelicula (\d+)', text)!=None:
                nac_dict[nac['Código Nacional'][n]
                         ] = int(re.search(r'con pelicula (\d+)',
                                    text).group().split()[-1])
    return nac_dict

Process indicators’ event log creation

With the objective of studying process indicators a function to make a process indicators’ event log is coded above. There, in addition to laboratory measures and realized exams, three artificial events have been included to aid the analysis.’INI’ event denotes each patient’s diabetes diagnosis date, ‘yFIN’ events indicate the end of each annual interval and ‘FIN’ event is the date of the end of the last completed annual interval.

Code
def evlog_process_ind(con):
    '''Generation of process indicators' eventlog

    Args:
      con : db connector variable
    '''
    
    con.sql("CREATE OR REPLACE VIEW pre_process_ind AS SELECT *,'start' AS cycle,\
                ROW_NUMBER() OVER (ORDER BY patient_id,date) AS actins \
            FROM (SELECT DISTINCT patient_id,'hba1c' AS event, \
            CAST(param_date AS TIMESTAMP) + INTERVAL 60 SECOND AS date \
            FROM param_incidents_view WHERE param_name = 'hba1c' \
            UNION ALL SELECT DISTINCT patient_id,'col' AS event, \
                    CAST(param_date AS TIMESTAMP) + INTERVAL 120 SECOND AS date \
            FROM param_incidents_view WHERE param_name IN  ('col','hdl','ldl') \
            UNION ALL SELECT DISTINCT patient_id,'blood_p' AS event, \
                    CAST(param_date AS TIMESTAMP) + INTERVAL 180 SECOND AS date \
            FROM param_incidents_view WHERE param_name IN  ('sbp','dbp') \
            UNION ALL SELECT DISTINCT patient_id,'indalbcr' AS event, \
                    CAST(param_date AS TIMESTAMP) + INTERVAL 240 SECOND AS date \
            FROM param_incidents_view WHERE param_name='indalbcr' \
            UNION ALL SELECT DISTINCT patient_id,'filtglom' AS event, \
                    CAST(param_date AS TIMESTAMP) + INTERVAL 300 SECOND AS date \
            FROM param_incidents_view WHERE param_name='filtglom' \
            UNION ALL SELECT DISTINCT patient_id,'bmi' AS event, \
                    CAST(param_date AS TIMESTAMP) + INTERVAL 360 SECOND AS date \
            FROM param_incidents_view WHERE param_name IN ('bmi','weight') \
            UNION ALL SELECT DISTINCT patient_id,'ocular_exam' AS event, \
                    CAST(test_date AS TIMESTAMP) + INTERVAL 420 SECOND AS date \
            FROM exams_incidents_view WHERE test_name='ocular_exam_PC' \
            UNION ALL SELECT DISTINCT patient_id,'foot_exam' AS event, \
                    CAST(test_date AS TIMESTAMP) + INTERVAL 480 SECOND AS date \
            FROM exams_incidents_view WHERE test_name='foot_exam' \
            UNION ALL SELECT patient_id,'INI' AS event, \
                    CAST(dx_date AS TIMESTAMP) AS date \
            FROM patient_incidents_view \
            UNION ALL SELECT patient_id, 'yFIN' AS event,\
                    CAST(dx_date AS TIMESTAMP) + INTERVAL 365 DAYS AS date \
            FROM patient_incidents_view \
            UNION ALL SELECT patient_id, 'yFIN' AS event,\
                    CAST(dx_date AS TIMESTAMP) + INTERVAL 730 DAYS AS date \
            FROM patient_incidents_view \
            UNION ALL SELECT patient_id, 'yFIN' AS event,\
                    CAST(dx_date AS TIMESTAMP) + INTERVAL 1095 DAYS AS date \
            FROM patient_incidents_view \
            UNION ALL SELECT patient_id, 'yFIN' AS event,\
                    CAST(dx_date AS TIMESTAMP) + INTERVAL 1461 DAYS AS date \
            FROM patient_incidents_view \
            UNION ALL SELECT patient_id, 'yFIN' AS event,\
                    CAST(dx_date AS TIMESTAMP) + INTERVAL 1826 DAYS AS date \
            FROM patient_incidents_view \
            UNION ALL SELECT patient_id, 'FIN' AS event,\
                    CAST(COALESCE(deregistration_date,'2023-01-01') AS TIMESTAMP) AS date \
            FROM patient_incidents_view)") 
    
    
    
    
    con.sql("CREATE OR REPLACE TABLE process_ind AS SELECT a.* FROM pre_process_ind a \
            JOIN (SELECT patient_id, MIN(date) as ini_date,MAX(date) as fin_date \
            FROM pre_process_ind d WHERE event IN ('INI', 'yFIN') AND \
                date<(SELECT c.date FROM pre_process_ind c \
                      WHERE c.event = 'FIN' AND c.patient_id = d.patient_id) \
            GROUP BY patient_id) b ON a.patient_id = b.patient_id \
            WHERE (a.date BETWEEN b.ini_date AND b.fin_date) OR a.event = 'FIN'")
    con.sql("CREATE OR REPLACE TABLE process_ind1 AS SELECT * FROM \
            (WITH RankedActions AS (SELECT patient_id, date, event,\
                    ROW_NUMBER() OVER (PARTITION BY patient_id ORDER BY date) AS rn \
                        FROM process_ind WHERE event = 'yFIN') \
             SELECT DISTINCT A.* FROM process_ind A \
                 JOIN RankedActions R ON A.patient_id = R.patient_id WHERE \
                     R.rn = 1 AND (A.date <= R.date OR A.event = 'FIN'))")  
    con.sql("CREATE OR REPLACE TABLE process_ind2 AS SELECT * FROM \
            (WITH RankedActions AS (SELECT patient_id, date, event,\
                    ROW_NUMBER() OVER (PARTITION BY patient_id ORDER BY date) AS rn \
                        FROM process_ind WHERE event = 'yFIN') \
             SELECT DISTINCT A.* FROM process_ind A \
                 JOIN RankedActions R ON A.patient_id = R.patient_id WHERE \
                     R.rn = 2 AND (A.date <= R.date OR A.event = 'FIN'))")        
    con.sql("CREATE OR REPLACE TABLE process_ind3 AS SELECT * FROM \
            (WITH RankedActions AS (SELECT patient_id, date, event,\
                    ROW_NUMBER() OVER (PARTITION BY patient_id ORDER BY date) AS rn \
                        FROM process_ind WHERE event = 'yFIN') \
             SELECT DISTINCT A.* FROM process_ind A \
                 JOIN RankedActions R ON A.patient_id = R.patient_id WHERE \
                     R.rn = 3 AND (A.date <= R.date OR A.event = 'FIN'))")        
    con.sql("CREATE OR REPLACE TABLE process_ind4 AS SELECT * FROM \
            (WITH RankedActions AS (SELECT patient_id, date, event,\
                    ROW_NUMBER() OVER (PARTITION BY patient_id ORDER BY date) AS rn \
                        FROM process_ind WHERE event = 'yFIN') \
             SELECT DISTINCT A.* FROM process_ind A \
                 JOIN RankedActions R ON A.patient_id = R.patient_id WHERE \
                     R.rn = 4 AND (A.date <= R.date OR A.event = 'FIN'))")        
    con.sql("CREATE OR REPLACE TABLE process_ind5 AS SELECT * FROM \
            (WITH RankedActions AS (SELECT patient_id, date, event,\
                    ROW_NUMBER() OVER (PARTITION BY patient_id ORDER BY date) AS rn \
                        FROM process_ind WHERE event = 'yFIN') \
             SELECT DISTINCT A.* FROM process_ind A \
                 JOIN RankedActions R ON A.patient_id = R.patient_id WHERE \
                     R.rn = 5 AND (A.date <= R.date OR A.event = 'FIN'))") 
        

Distances

One of the most important aim of process mining is to show and explain the processes. However, the great variety of traces does not allow us to draw any clear conclusions and it is often necessary to simplify our data. Another option that we can do before simplifying, to avoid the excessive losing of information and give another perspective to the analysis, is to cluster the traces and to analyze them by cluster. To do that we have to measure somehow the differences between the traces; the distance between them. There are some distances that we can use to this task: edit distances, vector term similarity, LDA based distances, embedding based distances… Some of them are shown below as functions to calculate the distance matrix of traces:

Code
#######   EDIT DISTANCE   ####### 
def calculate_dm_ED(traces,measure_f):
    '''Calculate distance matrix with some edit distance.
    
    Args:
      traces (list): patients' traces
      measure_f: some edit distance function
      
    Returns:
      dm: distance matrix
    '''
    id2word = corpora.Dictionary(traces)

    traces_e = [[id2word.token2id[t[n]] for n in range(len(t))] for t in traces]
    traces_e_str = list(set([str(traces_e[i]
                                 ) for i in range(len(traces_e))]))    
    len_t_r = len(traces_e_str)
    len_t = len(traces_e)
    dm = np.zeros((len_t,len_t), dtype = np.float32)
    same = measure_f(traces_e[0],traces_e[0])
    d_dic = {str(t):dict() for t in traces_e_str}
    for i in range(len_t_r):
        d_dic[traces_e_str[i]][traces_e_str[i]] = same
        for j in range(i+1,len_t_r):
            d_ij = measure_f(eval(traces_e_str[i]),
                             eval(traces_e_str[j]))
            d_dic[traces_e_str[i]][traces_e_str[j]] = d_ij
            d_dic[traces_e_str[j]][traces_e_str[i]] = d_ij

    for i in range(len_t):
        dm[i][i] = same
        for j in range(i+1,len_t):
            t_i = str(traces_e[i])
            t_j = str(traces_e[j])            
            d_ij = d_dic[t_i][t_j]
            dm[i][j] = d_ij
            dm[j][i] = d_ij
    if same == 1:
        dm = 1 - dm  
    
    return dm

#######   TERM VECTOR SIMILARITY   #######      
def calculate_dm_TV(traces):
    '''Calculate distance matrix with term vector similarity.
    
    Args:
      traces (list): list of traces
      
    Returns:
      dm (array): distance matrix
      vectorizer: TfidfVectorizer
      X: traces vectorized with TfidVectorizer
        
    '''
    corpus = [' '.join(t) for t in traces]
    vectorizer = TfidfVectorizer(tokenizer=str.split)
    X = vectorizer.fit_transform(corpus)
    print('calculatin dm ...')
    dm = np.asarray(np.matmul(X.todense(),X.todense().T))
    dm = 1 - dm.round(decimals=4)           
    return dm, vectorizer, X

#######   LDA BASED DISTANCE   #######      
def calculate_dm_LDA(traces,T=10):
    '''Calculate distance matrix with LDA model.
    
    Args:
      traces (list): list of traces
      T (int): number of topics of LDA model
    Returns:
      dm (array): distance matrix
      lda_model: LdaModel
      id2word (dict): tokenized events as keys and events by values
        
    '''

    # Create Dictionary
    id2word = corpora.Dictionary(traces)
    
    # Term Document Frequency
    corpus = [id2word.doc2bow(text) for text in traces]
    # Make LDA model
    lda_model = LdaModel(corpus=corpus,
                         id2word=id2word,
                         num_topics=T,
                         alpha = 1,
                         eta = 'auto',
                         random_state = 123)
    get_c_topic = np.array(
        lda_model.get_document_topics(corpus,minimum_probability = -0.1))
    sigma = np.asarray([[get_c_topic[i][j][1] 
              for j in range(T)] for i in range(len(corpus))])

    sigma2 = np.asarray(np.matmul(sigma,sigma.T))
    len_t = len(traces)
    dm = np.zeros((len_t,len_t), dtype = np.float32)
    
    same = sigma2[0][0]/np.sqrt(sigma2[0][0]*sigma2[0][0])
    for i in trange(len_t):
        dm[i][i] = same
        for j in range(i+1,len_t):
            d_ij = sigma2[i][j]/np.sqrt(sigma2[i][i]*sigma2[j][j])
            dm[i][j] = d_ij
            dm[j][i] = d_ij
    
            

    dm = 1-dm
    return dm, lda_model, id2word

Clustering

Once the distance matrix is obtained, we can proceed with the clustering. Each clustering method has its characteristics and its peculiarities. For example, we have to consider we have a distance matrix and not a data frame in which we apply directly the method. A hierarchical clustering algorithm seems to be a good choice in our case because in addition to the above it allows to choose the optimal number of clusters.

The next code box contains two different functions to choose the optimal number of clusters:

Code
def dendogram(dm,output_png='../../outputs/dendogram.png'):
  '''Plot and save dendogram.
    
    Args:
      dm (array): distance matrix
      output_png (str): saved dendogram's path

  '''

  dm_condensed = squareform(dm)
  
  matrix = linkage(
      dm_condensed,
      method='complete'
      )
  sys.setrecursionlimit(10000000)
  dn = dendrogram(matrix,truncate_mode='lastp',p=80)
  sys.setrecursionlimit(1000)
  plt.title('Dendrogram')
  plt.ylabel('Distance')
  plt.xlabel('Patients traces')
  plt.savefig(output_png)
  plt.clf()

def kelbow(dm,
           cond,
           elbow_metric='distortion',
           locate_elbow=False,
           output_path='../../outputs/'):
  
  '''Plots to choose optimal clusters.
    
    Args:
      dm (array): distance matrix
      cond (str): predominal clinical condition's code 
      elbow_metric (str): name of the method
      locate_elbow (boolean): True if want to return optimal number of clusters
    Returns:
      k_opt (int)(optional): optimal number of clusters according to method
  '''

  model = AgglomerativeClustering(metric = "precomputed",
                                  linkage = 'complete')
  # k is range of number of clusters.
  visualizer_ = KElbowVisualizer(model,
                                k=(2,25),
                                timings=False,
                                xlabel = 'cluster numbers',
                                metric=elbow_metric,
                                locate_elbow=locate_elbow)
  # Fit data to visualizer
  output_file = output_path+elbow_metric+'_'+cond+'.png'
  visualizer_.fit(dm)
  # Finalize and render figure
  visualizer_.show(outpath=output_file,clear_figure=True)
  k_opt=None
  if locate_elbow:
    k_opt = visualizer_.elbow_value_ 
    return k_opt

The function ‘clust’ clusterizes traces in prefixed number of clusters:

Code
def clust(clust_n,dist_matrix,df_,id2trace,patients):
  '''clusterize distance matrix.
    
    Args:
      clust_n (int): number of clusters obtained
      dist_matrix (array): distance matrix
      df_ (dataframe): event log
      id2trace (dict): patient ids as keys and their traces as values
      patients (list): patients' ids in same order as in dm
    Returns:
      df_ (dataframe): dataframe with patients and their clusters
  '''
  traces = list(id2trace[id] for id in sorted(id2trace.keys()))     


  model = AgglomerativeClustering(n_clusters=clust_n,
                                  metric = "precomputed",
                                  linkage = 'complete')
  model.fit(dist_matrix)
  labels = model.labels_

  cluster_list ={id: labels[traces.index(id2trace[id])
                            ] for id in patients}

  df_['cluster'] = [cluster_list[df_['ID'][i]] for i in range(len(df_))]
  return df_

Descriptive analysis of treatments’ eventlog

The implementation below is made to show the most frequent traces in each cluster:

Code
def make_data_dict(log,top_k,col_id):
  '''Obtain most frequent traces and their statistics
    
    Args:
      log (dataframe): event log 
      top_k (int): number of traces want to show
      col_id (str): patients id column's name in df_
    Returns:
      data_dict (dict): traces as keys and ther statistics as values   
  '''

  len_id = len(set(log[col_id]))
  log_freq = pm4py.stats.get_variants(log)
  freq_list = [(t,log_freq[t],len(t)) for t in set(log_freq.keys())]
  trace = [list(t[0]) for t in sorted(freq_list, key=lambda x: 
                                              (len_id-x[1],x[2]))[:top_k]]
  cases = [t[1] for t in sorted(freq_list, key=lambda x: 
                                              (len_id-x[1],x[2]))[:top_k]]
  top_k = min(top_k,len(cases))
  percentage = [100*cases[c]/len_id for c in range(top_k)]
  cum_percentage = [sum(percentage[:p+1]) for p in range(top_k)]
  data_dict = {"Trace": trace,
               "Percentage": percentage,
               "Cases": cases,
               "Cumulative Percentage": cum_percentage}
  return data_dict
  
  
def update_color_dict(color_dict,data_dict):
  '''update of the color dict to include new events
    
    Args:
      color_dict (dict): events as keys and colors as values 
      data_dict (dict):  traces as keys and ther statistics as values
    Returns:
      color_dict (dict): events as keys and colors as values    
  '''
  cmap = plt.cm.get_cmap('tab20')
  for event in set(itertools.chain.from_iterable(data_dict['Trace'])):
      if event not in color_dict and len(color_dict)==20:
         cmap = plt.cm.get_cmap('tab20b')
      if event not in color_dict:    
        try:
          color_dict.update({event:cmap(len(color_dict))})
        except:
          color_dict.update({event:cmap(2*(len(color_dict)-20))})
  return color_dict 


def trace_plotter(data_dict,
                  color_dict,
                  acronym,
                  output_file,
                  font_size=10,
                  percentage_box_width=0.8,
                  size=(15,9)):
  '''configuration of the trace_explorer plot
    
    Args:
      color_dict (dict): events as keys and colors as values 
      data_dict (dict):  traces as keys and their statistics as values
      acronym (dict): events as keys and their acronyms as values
      output_file (str): figure's path
      font_size (int): font size
      percentage_box_width (float): event boxes' width
      size (tuple): figure's size
  '''

  fig, ax = plt.subplots(figsize=size)
  percentage_position = max(len(t) for t in data_dict["Trace"]
                            ) + percentage_box_width*3 +0.5
  for row, (trace, percentage,cases,cum_percentage
            ) in enumerate(zip(data_dict["Trace"],
                               data_dict["Percentage"],
                               data_dict["Cases"],
                               data_dict["Cumulative Percentage"]),
                               start=1):
    for col, acr in enumerate(trace, start=1):
        ax.add_patch(plt.Rectangle((col - 0.5, row - 0.45), 1, 0.9,
                                    facecolor=color_dict[acr],
                                    edgecolor='white'))
        ax.text(col, 
                row, 
                acr, 
                ha='center', 
                va='center', 
                color='white',
                fontsize = font_size, 
                fontweight='bold')
        ax.add_patch(plt.Rectangle((
          percentage_position -percentage_box_width*2.5,
          row - 0.45), percentage_box_width, 0.9,
          facecolor='grey', edgecolor='white'))
        ax.text(percentage_position-percentage_box_width*2,
                row,
                f'{percentage:.2f}%',
                ha='center',
                va='center',
                color='white',
                fontsize = font_size+2)
        ax.add_patch(plt.Rectangle((
          percentage_position - percentage_box_width*1.5,
          row - 0.45), percentage_box_width, 0.9,
          facecolor='grey', edgecolor='white'))
        ax.text(percentage_position-percentage_box_width,
                row,
                f'{cases}',
                ha='center',
                va='center',
                color='white',
                fontsize = font_size+4)
        ax.add_patch(plt.Rectangle((percentage_position-percentage_box_width*0.5, 
                                    row - 0.45), percentage_box_width, 0.9,
                                    facecolor='grey', edgecolor='white'))
        ax.text(percentage_position,
                row,
                f'{cum_percentage:.2f}%', 
                ha='center',
                va='center',
                color='white',
                fontsize = font_size+2)
      
  ax.set_xlim(0.5, percentage_position+0.5)
  ax.set_xticks(range(1, int(percentage_position-1)))
  ax.set_ylabel('Traces',fontsize = font_size+3)
  ax.set_ylim(len(data_dict["Trace"]) + 0.45, 0.55) # y-axis is reversed
  ax.set_yticks([])
  ax.set_xlabel('Activities',fontsize = font_size+3)
  
  handles = [plt.Rectangle((0, 0), 0, 0, facecolor=color_dict[acr],
                            edgecolor='black', label=acronym[acr])
              for acr in acronym if acr in set(
                itertools.chain.from_iterable(data_dict['Trace']))]
  ax.legend(handles=handles, 
            bbox_to_anchor=[1.02, 1.02],
            loc='upper left',
            fontsize = font_size+6)
  for dir in ['left', 'right', 'top']:
      ax.spines[dir].set_visible(False)
  plt.tight_layout()
  plt.savefig(output_file)
  plt.close()


def trace_explorer(con,
                   cond,
                   top_k=5,
                   id_col='ID',
                   ev_col='Event',
                   date_col='date',
                   clust_col='cluster',
                   color_dict={}):
  '''Plot each clusters most frequent traces
    
    Args:
      con : db connector variable
      cond (str): predominal clinical condition's code
      top_k (int):  traces as keys and their statistics as values
      id_col (str): patients id column's name in evlog_file
      ev_col (str): events column's name in evlog_file
      date_col (str): events dates column's name in evlog_file
      clust_col (str): cluster column's name in evlog_file
      color_dict (dict): events as keys and colors as values  
  '''
  log_ = pm4py.format_dataframe(con.sql(f"SELECT * FROM eventlog_{cond}_clust_filtered").df(),
                                case_id=id_col,
                                activity_key=ev_col,
                                timestamp_key=date_col)                  
  log_ = log_.sort_values([id_col,date_col])
  log_ = log_[log_['cycle']=='start'] 
  for clust in set(log_[clust_col]):
      log = log_[log_[clust_col]==clust]
      len_id = len(set(log[id_col]))
      acronym = {t:t for t in sorted(set(log[ev_col]))}
      data_dict = make_data_dict(log,top_k,id_col)
      color_dict = update_color_dict(color_dict, data_dict)
      trace_plotter(data_dict,color_dict,acronym,
                    '../../outputs/t_cluster_%s_%i.png' % (cond,clust))
  return color_dict

To get the process maps of each cluster next R functions can be used:

Code
load_log <- function(con,
                     query,case_id="ID",
                     activity_id="Event", 
                     lifecycle_id="cycle",
                     activity_instance_id="actins",
                     timestamp="date"){
  
  eventlog <- dbGetQuery(con,query)  
  eventlog = eventlog[order(eventlog$ID),]
  #To transform date to a format we can work with
  eventlog$date = as.POSIXct(eventlog$date, tz = "", format="%Y-%m-%d" ,
                                   tryFormats = c("%Y/%m/%d",
                                                  "%Y/%m/%d"),
                                   optional = FALSE) 
  
  
  
  evLog = eventlog %>%
    mutate(resource = NA) %>%
    mutate(cycle = fct_recode(cycle,
                              "start" = "start",
                              "complete" = "end")) %>%
    eventlog(
      case_id = case_id,
      activity_id = activity_id, 
      lifecycle_id = lifecycle_id, 
      activity_instance_id = activity_instance_id, 
      timestamp = timestamp, 
      resource_id = 'resource'
    )
  return(evLog)
}
make_process_map <- function(log,t_freq,output_file){
  log %>%
    filter_activity_frequency(percentage = 1) %>% # show only most frequent
    filter_trace_frequency(percentage = t_freq) %>%
    process_map(type_nodes = performance(mean,units='days'),
                sec_nodes = frequency('relative_case'),
                type_edges = frequency('absolute'),
                sec_edges = frequency('relative_case'),
                render = T) %>% 
    export_svg %>% 
    charToRaw %>%
    rsvg_png (output_file,width=2000)
}

process_map_by_cluster <- function(evLog,t_freq,cond_code){
  for (clust in unique(evLog$cluster)) {
    log <- evLog %>% 
      filter(cluster == clust)
    make_process_map(log,t_freq,gsub("src/analysis-scripts/",
                                     "",here("outputs",sprintf(
                                       "pm_cluster_%s_%d.png", 
                                       cond_code, clust) )))

  }
}

Conformance checking

Conformance checking is a technique used to check process compliance by comparing event logs for a discovered process with the existing reference model (target model) of the same process. Basing on the DM2 treatment algorithm previous shown, with a software called Carassius , we created the next Petri Nets that are going to be useful as treatment guidelines in reference to glycated hemoglobin measures.

Figure 1: DM2 Treatment Algorithm’s interpretation to patients with some CV disease in Petri Net format

Figure 2: DM2 Treatment Algorithm’s interpretation to patients with heart failure in Petri Net format

Figure 3: DM2 Treatment Algorithm’s interpretation to patients with some chronic kidney disease in Petri Net format

Figure 4: DM2 Treatment Algorithm’s interpretation to patients with frailty in Petri Net format

Figure 5: DM2 Treatment Algorithm’s interpretation to patients with obesity in Petri Net format

Figure 6: DM2 Treatment Algorithm’s interpretation to patients without predominant clinical condition in Petri Net format

Fitness is the metric that measures how much a trace is distanced from a given process model, or from the guidelines in this case. There are different methods to calculate this metric but in the code below is used the aligned fitness. Since in this metric the initial marking and the final marking have to be fixed we included the events ‘INI’ and ‘FIN’ in the Petri Net and in each trace. Adding this artificial events allows us to compare all traces fitness in the same conditions.

Code
def id2treat_fitness(log ,
               net, 
               initial_marking, 
               final_marking, 
               clust_col='cluster',
               date_col='date',
               ev_col='Event'):
  '''Obtain traces fitness
    
    Args:
      log (dataframe): event log 
      net: petri net
      initial_marking: initial place in the petri net
      final_marking: final place in the petri net
      clust_col (str): cluster column's name in log
      date_col (str): events dates column's name in log
      ev_col (str): events column's name in log
    Returns:
      df (dataframe): traces, their clusters and their fitnesses  
  '''
  id2trace = {id:list(log['Event'][log['case:concept:name']==id]
                      ) for id in set(log['case:concept:name'])}
  id2ids = {id:[id2 for id2 in set(id2trace.keys()) if id2trace[id]==id2trace[id2]
                ] for id in set(log['case:concept:name'])}
  for id in set(id2ids.keys()):
      try:
          for id2 in id2ids[id]:
              if id2!=id:
                  del id2ids[id2]
      except:
          continue
  
  id2fitness = dict()
  for name in set(id2ids.keys()):
      log2 = log.drop(log.index[log['case:concept:name'] !=name])
      new = log2.copy().iloc[[0, -1]]
      date_list = list(new[date_col])
      index = list(new['@@index'])
      actins = list(new['actins'])
      clust = new[clust_col].values[0]
      new[date_col] =  new['time:timestamp'] = [date_list[0]-timedelta(days=1),
                                                date_list[1]+timedelta(days=1)]
      new[ev_col] = new['concept:name'] = ['INI','FIN']
      new['@@index'] = [index[0]-1,index[1]+1]
      new['actins'] = [actins[0]-1,actins[1]+1]
    
      log2 = pd.concat([log2,new]).sort_values('time:timestamp')
      aligned_fitness = pm4py.conformance_diagnostics_alignments(
                        log2, net, initial_marking, final_marking)[0]['fitness']
      for id in id2ids[name]:
          id2fitness[id] = {'ID':id,
                            'aligned_fitness':aligned_fitness,
                            clust_col: clust}
  
  df = pd.DataFrame(id2fitness).T
  df.index = range(len(df))
  return df

The function below takes a treatments’ event log and returns each patient’s period’s fitness.

Code
def id2treat_fitness_by_interval(con,
                           cond,
                           pn_file,
                           ini_place='place100',
                           fin_place='place111',
                           date_n_col='date',
                           fixed_period_time=90):
    '''Obtain traces fitness by intervals of fixed_period_time days
    
    Args:
      con: db connector variable
      cond (str): predominal clinical condition's code
      pn_file (str): petri net's path
      ini_place (str): initial place in the petri net
      fin_place (str): final place in the petri net
      date_n_col (str): events dates column's name in log
      fixed_period_time (int): number of days in each interval
  '''
    net, initial_marking, final_marking = pm4py.read_pnml(pn_file)
    initial_marking = Marking()
    initial_marking[list(net.places)[[str(p) for p in net.places].index(
        ini_place)]] = 1
    final_marking = Marking()
    final_marking[list(net.places)[[str(p) for p in net.places].index(
        fin_place)]] = 1
    event_log = pm4py.format_dataframe(con.sql(f"SELECT * FROM eventlog_{cond}_clust \
    WHERE cycle = 'start'").df(),
                                case_id='ID',
                                activity_key='Event',
                                timestamp_key=date_n_col) 
    event_log = event_log[event_log['cycle']=='start'] 
    baseline = datetime.strptime('2017-01-01', "%Y-%m-%d")
    endline = datetime.strptime('2023-01-01', "%Y-%m-%d")
    dd = [baseline + timedelta(days=x) for x in range((
              endline-baseline).days + 1)]
    dd_INI = dd[0]-timedelta(days=1)
    dd_FIN = dd[-1]+timedelta(days=1)
    event_log['time:timestamp'] = event_log['time:timestamp'].dt.tz_localize(None)
    event_log[date_n_col] = event_log[date_n_col].dt.tz_localize(None)
    event_log['date_'] = event_log[date_n_col].apply(
        lambda x: dd.index(x))
    event_log[date_n_col] = event_log[date_n_col].apply(
        lambda x: dd.index(x))
    event_log[date_n_col] = event_log.groupby('ID')[date_n_col].apply(
        lambda x: x-x.min())
    
    hosp = con.sql("SELECT * FROM cmbd_incidents_postdx_first_view").df()
    hosp['admission_date'] = hosp['admission_date'].dt.tz_localize(None).apply(
        lambda x: dd.index(x))
    hosp = dict(zip(hosp.patient_id,hosp.admission_date))
    
    period2fitness = pd.DataFrame()
    id_list_df = []
    p_start = []
    p_end = []
    fitness = []
    date_0 = []
    status = []
    id_list = list(set(event_log['ID']))
    for id in tqdm(id_list):
        event_log_id = event_log[event_log['ID']==id]
        hosp_d = hosp.get(id,100000)-min(event_log_id['date_'])
        date_max = min(max(event_log_id[date_n_col]),hosp_d)
        date_min = min(event_log_id['time:timestamp'])
        n_periods = date_max//fixed_period_time
        n_periods_r = date_max/fixed_period_time
        if date_max<=fixed_period_time:
            continue
        for n in range(1,n_periods+1):
            event_log_id_n = event_log_id[
                event_log_id[date_n_col]<n*fixed_period_time]
            
            new = event_log_id.copy().iloc[[0, -1]]
            actins = list(new['actins'])
            index = list(new['@@index'])
            new['time:timestamp'] = [dd_INI,dd_FIN]
            new['concept:name'] = ['INI','FIN']
            new['@@index'] = [index[0]-1,index[1]+1]
            new['actins'] = [actins[0]-1,actins[1]+1]
            event_log_id_n = pd.concat([event_log_id_n,new]
                                       ).sort_values('time:timestamp')
            aligned_fitness = pm4py.conformance_diagnostics_alignments(
                             event_log_id_n, net, 
                             initial_marking, final_marking)[0]['fitness']
            id_list_df.append(id)
            p_start.append((n-1)*fixed_period_time)
            p_end_value = n*fixed_period_time
            if n==n_periods:
                p_end_value = date_max-fixed_period_time
            p_end.append(p_end_value)
            fitness.append(aligned_fitness)
            date_0.append(date_min)
            if (n==n_periods or n==n_periods_r-1) and hosp_d==date_max :
                status.append(1)
                break
            else:
                status.append(0)
    
    period2fitness['ID'] = id_list_df
    period2fitness['t_0'] = p_start
    period2fitness['t_1'] = p_end
    period2fitness['fitness'] = fitness
    period2fitness['ini_date'] = date_0
    period2fitness['status'] = status
    con.sql(f"CREATE OR REPLACE TABLE period2fitness_{cond} AS SELECT *,\
            MAX(status) OVER (PARTITION BY ID) AS status2  FROM period2fitness \
            WHERE t_0!=t_1") 

In the next function is shown a boxplot function to show clusters’ fitness distribution.

Code
def treatments_clusters_boxplot(con,
                cond,
                pn_file,
                pn_png_file,
                ini_place='place100',
                fin_place='place111',
                date_col='date',
                ev_col='Event',
                clust_col='cluster',
                output_png='../../outputs/fitness_by_cluster.png'):
  '''Barplot of the fitness of each cluster
    
    Args:
      con: db connector variable
      cond (str): predominant clinical condition
      pn_file: petri net's path
      ini_marking: initial place in the petri net
      fin_marking: final place in the petri net
      date_col (str): events dates column's name in log
      clust_col (str): cluster column's name in log
      ev_col (str): events column's name in log
      output_png (str): created figure's path
  '''
  log = pm4py.format_dataframe(con.sql(f"SELECT * FROM eventlog_{cond}_clust_filtered \
  WHERE cycle = 'start'").df(),
                                case_id='ID',
                                activity_key=ev_col,
                                timestamp_key=date_col)                  

  log = log.sort_values(date_col)
  log.index = range(len(log))
  net, initial_marking, final_marking = pm4py.read_pnml(pn_file)
  initial_marking = Marking()
  initial_marking[list(net.places)[[str(p) for p in net.places].index(
    ini_place)]] = 1
  final_marking = Marking()
  final_marking[list(net.places)[[str(p) for p in net.places].index(
    fin_place)]] = 1
  pm4py.save_vis_petri_net(net, initial_marking, final_marking,pn_png_file)
  df = id2treat_fitness(log,net,initial_marking,final_marking,
                  clust_col,date_col,ev_col)
  df['aligned_fitness'] = pd.to_numeric(df['aligned_fitness'])
  df['cluster'] = pd.to_numeric(df['cluster'])
  df['ID'] = df['ID'].astype("string")
  con.sql(f"CREATE OR REPLACE TABLE eventlog_{cond}_byclust AS \
  SELECT * FROM df")
  data = [list(df['aligned_fitness'][df[clust_col]==i])
          for i in sorted(set(df[clust_col]))]
   
  fig = plt.figure(figsize =(10, 7))
  # Creating axes instance
  ax = fig.add_axes([0, 0, 1, 1])
  # Creating plot
  bp = ax.boxplot(data,labels=[i for i in sorted(set(df[clust_col]))])
  plt.xlabel("Clusters")
  plt.ylabel("Aligned Fitness")
  # show plot
  plt.savefig(output_png,bbox_inches='tight')
  plt.close(fig)
  

Process indicators can also be represented in a Petri Net. Therefore we created the next Petri Nets to analyze whether the traces obey the guidelines the first five years:

Figure 7: Petri Net of process indicators of the first year

Figure 8: Petri Net of process indicators of the first two years

Figure 9: Petri Net of process indicators of the first three years

Figure 10: Petri Net of process indicators of the first four years

Figure 11: Petri Net of process indicators of the first five years

The function below takes a process indicators’ event log and returns each patient’s fitness by year. The method to calculate the fitness in this case is the token base replay method.

Code
def id2process_fitness(con,
                       query,
                       pn_file,
                       pn_png_file,
                       ini_place='place37',
                       fin_place='place148',
                       id_col = 'patient_id',
                       event_col = 'event',
                       date_col='date',):
    '''Calculate the fitness of a given process indicators' event log
    
    Args:
      con: db connector variable
      cond (str): predominal clinical condition's code
      query (str): query to select event log
      pn_file (str): petri net's path
      ini_place (str): initial place in the petri net
      fin_place (str): final place in the petri net
      id_col (str): ids' column's name in event log
      event_col (str): events' column's name in event log
      date_col (str): events' dates' column's name in event log
    Returns:
      df (dataframe): dataframe of ids' fitness
  '''
    net, initial_marking, final_marking = pm4py.read_pnml(pn_file)
    initial_marking = Marking()
    initial_marking[list(net.places)[[str(p) for p in net.places].index(
        ini_place)]] = 1
    final_marking = Marking()
    final_marking[list(net.places)[[str(p) for p in net.places].index(
        fin_place)]] = 1
    pm4py.save_vis_petri_net(net, initial_marking, final_marking,pn_png_file)
    event_log = pm4py.format_dataframe(con.sql(query).df(),
                                case_id=id_col,
                                activity_key=event_col,
                                timestamp_key=date_col) 
    date_list = []
    fit_list = []
    id_list = list(set(event_log[id_col]))
    for id in tqdm(id_list):
        log = event_log.drop(event_log.index[event_log['case:concept:name'] !=id]
                              ).sort_values('time:timestamp')
        fit_obj = pm4py.conformance.conformance_diagnostics_token_based_replay(
            log, net, initial_marking, final_marking)
        date_list.append(list(log['time:timestamp'])[-2])
        fit_list.append(fit_obj[0]['trace_fitness'])
        
    log_0 = log[log['concept:name'].isin(['INI','FIN','yFIN'])]
    alfa = pm4py.conformance.conformance_diagnostics_token_based_replay(
        log_0, net, initial_marking, final_marking)[0]['trace_fitness']
    
    df =  pd.DataFrame()
    df['ID'] = id_list
    df['fitness'] = [(x-alfa)/(1-alfa) for x in fit_list]
    df['date'] = date_list
    return df

Time dependent Cox model

The link between guideline adherence, in terms of performed process measures, and clinical outcomes is a highly demanded issue in diabetes care. A Cox model is a statistical technique for exploring the relationship between the survival of a patient and several explanatory variables. One of the strengths of the Cox model is its ability to encompass covariates that change over time, such as treatment adherence. A time dependent Cox model can be made using each patient trace’s fitness at different time interval.

Joint Latent Class Model

Joint models are used to analyse simultaneously two related phenomena, the evolution of a variable and the occurence of an event. Joint latent class models (JLCM) consist of a linear mixed model and a proportional hazard model linked by the latent classes. The population is split in several groups, the latent classes, and each class is caracterized by a specific evolution of the dependent variable and an associated risk of event. Using fitness as time dependent treatment adherence measure we can made a joint latent class model.

Results

Cohort description

Code
###INCIDENTS
patient_inc <- dbGetQuery(con,
                       "SELECT *,
                       CASE WHEN sex = 0 THEN 'Male' 
                       ELSE 'Female' END AS sexo FROM patient_incidents_view") 
patient_inc <- patient_inc %>%
  left_join( country2cont %>% select(country, CC), by = "country") %>%
  mutate(education=factor(education, levels=c(0,1,2,3),
      labels=c("Without studies","Primary school",
               "High school","University")),
      copayment=factor(copayment,levels=c(0,1),
                       labels=c("less than 18000","more or equal than 18000")))
patient_inc$CC[patient_inc$country == 724] <- 'Spain'
dependent="sexo"
explanatory=c("age","age_dx","CC","copayment","education", "avg_income")  

patient_inc%>%summary_factorlist(dependent,
                              explanatory,
                              total_col = TRUE,
                              cont="mean",
                              cont_cut=7,
                              na_include = TRUE,
                              na_to_prop =  FALSE, 
                              include_col_totals_percent    =FALSE,
                              add_col_totals = TRUE)-> patient_inc_tab

kable(patient_inc_tab, row.names=FALSE, align=c("l", "l", rep("r",5)))
Table 1: Demographic and socioeconomic characteristics of incidents patients
label levels Female Male Total
Total N 2728 2516 5244
age Mean (SD) 78.6 (13.4) 78.6 (13.2) 78.6 (13.3)
age_dx Mean (SD) 80.4 (13.5) 80.4 (13.3) 80.4 (13.4)
CC AS 1364 (50.0) 1258 (50.0) 2622 (50.0)
EU 682 (25.0) 629 (25.0) 1311 (25.0)
OC 682 (25.0) 629 (25.0) 1311 (25.0)
copayment less than 18000 892 (49.4) 904 (52.4) 1796 (50.9)
more or equal than 18000 912 (50.6) 820 (47.6) 1732 (49.1)
(Missing) 924 792 1716
education Without studies 0 (NaN) 0 (NaN) 0 (NaN)
Primary school 0 (NaN) 0 (NaN) 0 (NaN)
High school 0 (NaN) 0 (NaN) 0 (NaN)
University 0 (NaN) 0 (NaN) 0 (NaN)
(Missing) 2728 2516 5244
avg_income (Missing) 2728 2516 5244
Code
param_inc <- dbGetQuery(con,"SELECT * FROM param_incidents_predx_last_view")
param_list <- sort(unique(param_inc$param_name))
param_cat_inc <- dbGetQuery(con,
  "SELECT *, \
  CASE WHEN param_cat_name='physical_activity' AND param_cat_value=0 \
  THEN 'Incapacity' \
  WHEN param_cat_name='physical_activity' AND param_cat_value=1 \
  THEN 'Inactive' \
  WHEN param_cat_name='physical_activity' AND param_cat_value=2 \
  THEN 'Partially Active' \
  WHEN param_cat_name='physical_activity' AND param_cat_value=3 \
  THEN 'Active' \
  WHEN param_cat_name='smoking_status' AND param_cat_value=0 \
  THEN 'Non Smoker' \
  WHEN param_cat_name='smoking_status' AND param_cat_value=1 \
  THEN 'Ex-smoker < 1 year' \
  WHEN param_cat_name='smoking_status' AND param_cat_value=2 \
  THEN 'Ex-smoker >= 1 year' \
  WHEN param_cat_name='smoking_status' AND param_cat_value=3 \
  THEN 'Smoker' \
  WHEN param_cat_name='alcohol' AND param_cat_value=0 \
  THEN 'Abstinent' \
  WHEN param_cat_name='alcohol' AND param_cat_value=1 \
  THEN 'Moderate Drinker' \
  WHEN param_cat_name='alcohol' AND param_cat_value=2 \
  THEN 'Heavy Drinker' \
  WHEN param_cat_name='vaccination_flu' AND param_cat_value=0 \
  THEN 'No' \
  WHEN param_cat_name='vaccination_flu' AND param_cat_value=1 \
  THEN 'Yes' \
  WHEN param_cat_name='working_status' AND param_cat_value=0 \
  THEN 'Active' \
  WHEN param_cat_name='working_status' AND param_cat_value=1 \
  THEN 'Unemployed' \
  WHEN param_cat_name='working_status' AND param_cat_value=2 \
  THEN 'Pensionist' \
  ELSE '' END AS param_cat_value_str
  FROM param_cat_incidents_predx_last_view WHERE param_cat_name!='vaccination_covid'")
param_cat_list <- sort(unique(param_cat_inc$param_cat_name))
dependent="sexo"


param_inc2unnest <- param_inc %>% 
  pivot_wider( id_cols=patient_id,names_from=param_name,values_from=param_value)%>%
  left_join( patient_inc %>% select(patient_id, sexo), by = "patient_id")

param_cat_inc2unnest <- param_cat_inc %>% 
  pivot_wider( id_cols=patient_id,names_from=param_cat_name,values_from=param_cat_value_str)%>%
  left_join( patient_inc %>% select(patient_id, sexo), by = "patient_id") 

param_inc_tab <- rbind(
  param_inc2unnest %>%
    summary_factorlist(dependent ,
                       param_list,
                       total_col = TRUE,cont="mean",na_include = TRUE,
                       na_to_prop =  FALSE,add_row_totals = TRUE,
                       include_row_missing_col = TRUE,
                       include_col_totals_percent   =FALSE,add_col_totals = TRUE),
  param_cat_inc2unnest %>%
    summary_factorlist(dependent ,
                       param_cat_list,
                       total_col = TRUE,cont="mean",na_include = TRUE,
                       na_to_prop =  FALSE,cont_cut=8,add_row_totals = TRUE,
                       include_row_missing_col = TRUE))
kable(param_inc_tab, row.names=FALSE, align=c("l", "l", rep("r",5)))
Table 2: Demographic and socioeconomic characteristics of incidents patients
label Total N Missing N levels Female Male Total
Total N 2728 2516 5244
alb 2664 (50.8) 2580 Mean (SD) 4.1 (0.5) 4.0 (0.5) 4.0 (0.5)
bmi 2684 (51.2) 2560 Mean (SD) 36.1 (10.1) 36.0 (10.0) 36.1 (10.1)
col 2652 (50.6) 2592 Mean (SD) 206.4 (60.4) 214.2 (63.1) 210.3 (61.8)
creat 2756 (52.6) 2488 Mean (SD) 1.7 (0.7) 1.7 (0.8) 1.7 (0.7)
dbp 2576 (49.1) 2668 Mean (SD) 77.5 (15.9) 78.0 (15.6) 77.7 (15.7)
filtglom 2660 (50.7) 2584 Mean (SD) 68.4 (30.1) 68.2 (30.0) 68.4 (30.0)
hba1c 2716 (51.8) 2528 Mean (SD) 8.3 (2.0) 8.6 (2.0) 8.5 (2.0)
hdl 2664 (50.8) 2580 Mean (SD) 57.4 (21.2) 59.9 (21.6) 58.7 (21.4)
height 2728 (52.0) 2516 Mean (SD) 1.7 (0.2) 1.7 (0.2) 1.7 (0.2)
indalbcr 2624 (50.0) 2620 Mean (SD) 591.0 (345.9) 566.3 (330.6) 579.0 (338.7)
ldl 2692 (51.3) 2552 Mean (SD) 126.1 (51.5) 127.4 (50.8) 126.7 (51.2)
sbp 2672 (51.0) 2572 Mean (SD) 140.2 (27.5) 139.5 (27.6) 139.8 (27.6)
trgl 2676 (51.0) 2568 Mean (SD) 240.2 (115.7) 239.3 (111.1) 239.8 (113.5)
weight 2660 (50.7) 2584 Mean (SD) 122.0 (46.3) 123.9 (46.7) 122.9 (46.5)
alcohol 1588 (35.0) 2944 Abstinent 272 (32.1) 244 (33.0) 516 (32.5)
Heavy Drinker 312 (36.8) 252 (34.1) 564 (35.5)
Moderate Drinker 264 (31.1) 244 (33.0) 508 (32.0)
family_history_CVD 1816 (40.1) 2716 972 (100.0) 844 (100.0) 1816 (100.0)
physical_activity 1668 (36.8) 2864 Active 264 (29.3) 204 (26.6) 468 (28.1)
Inactive 152 (16.9) 192 (25.0) 344 (20.6)
Incapacity 248 (27.6) 208 (27.1) 456 (27.3)
Partially Active 236 (26.2) 164 (21.4) 400 (24.0)
smoking_status 1748 (38.6) 2784 Ex-smoker < 1 year 188 (20.4) 164 (19.8) 352 (20.1)
Ex-smoker >= 1 year 260 (28.3) 232 (28.0) 492 (28.1)
Non Smoker 204 (22.2) 180 (21.7) 384 (22.0)
Smoker 268 (29.1) 252 (30.4) 520 (29.7)
vaccination_flu 1724 (38.0) 2808 Yes 940 (100.0) 784 (100.0) 1724 (100.0)
working_status 1616 (35.7) 2916 Active 328 (38.9) 256 (33.2) 584 (36.1)
Pensionist 244 (28.9) 328 (42.5) 572 (35.4)
Unemployed 272 (32.2) 188 (24.4) 460 (28.5)
Code
patient_inc$deregistration_date <- patient_inc$deregistration_date %>%
  replace_na(as.Date('2023-01-01'))
patient_inc <- patient_inc %>% 
  mutate(follow_up_time=as.numeric(difftime(deregistration_date,
                                            dx_date,units="days")/365.25)) %>%
  filter(follow_up_time!=0)

ss_use_inc <- dbGetQuery(con," SELECT * FROM main.ss_use \
            WHERE patient_id IN (SELECT patient_id FROM main.patient \
                                 WHERE dx_date >='2017-01-01') AND \
                  visit_date>= (SELECT dx_date FROM main.patient WHERE \
                  main.patient.patient_id = main.ss_use.patient_id)") %>% mutate(
                        visit_type=factor(visit_type, levels=c(1,2,3,4,5,6,7,8),
                        labels=c("PC_physician", "PC_nurse", "PC_social_worker",
                                 "PC_emergency_service","PC_others",
                                 "Specialized_visit_physician",
                                 "Specialized_visit_nurse",
                                 "Specialized_visit_unknown_professional")))

ss_use_inc1 <- ss_use_inc %>% 
  filter(visit_loc == 1)
use_inc_freq1 <- as.data.frame.matrix(table(ss_use_inc1$patient_id,ss_use_inc1$visit_type))
use_inc_freq1 <- cbind(patient_id = rownames(use_inc_freq1), use_inc_freq1)
use_inc_freq1_time <- left_join(use_inc_freq1, patient_inc %>% select(patient_id, sex, follow_up_time,dx_date), by = "patient_id") %>%
  mutate(PC_physician= PC_physician/follow_up_time,
         PC_nurse=PC_nurse/follow_up_time,
         PC_social_worker=PC_social_worker/follow_up_time,
         PC_emergency_service=PC_emergency_service/follow_up_time,
         PC_others=PC_others/follow_up_time,
         Specialized_visit_physician=Specialized_visit_physician/follow_up_time,
         Specialized_visit_nurse=Specialized_visit_nurse/follow_up_time,
         Specialized_visit_unknown_professional=Specialized_visit_unknown_professional/follow_up_time,
         sexo=factor(sex, levels=c(0,1), labels=c("Male", "Female"))) 

ss_use_inc2 <- ss_use_inc %>% 
  filter(visit_loc == 2)
use_inc_freq2 <- as.data.frame.matrix(table(ss_use_inc2$patient_id,ss_use_inc2$visit_type))
use_inc_freq2 <- cbind(patient_id = rownames(use_inc_freq2), use_inc_freq2)
use_inc_freq2_time <- left_join(use_inc_freq2, patient_inc %>% select(patient_id, sex, follow_up_time,dx_date), by = "patient_id") %>%
  mutate(PC_physician= PC_physician/follow_up_time,
         PC_nurse=PC_nurse/follow_up_time,
         PC_social_worker=PC_social_worker/follow_up_time,
         PC_emergency_service=PC_emergency_service/follow_up_time,
         PC_others=PC_others/follow_up_time,
         Specialized_visit_physician=Specialized_visit_physician/follow_up_time,
         Specialized_visit_nurse=Specialized_visit_nurse/follow_up_time,
         Specialized_visit_unknown_professional=Specialized_visit_unknown_professional/follow_up_time,
         sexo=factor(sex, levels=c(0,1), labels=c("Male", "Female"))) 

ss_use_inc3 <- ss_use_inc %>% 
  filter(visit_loc == 3)
use_inc_freq3 <- as.data.frame.matrix(table(ss_use_inc3$patient_id,ss_use_inc3$visit_type))
use_inc_freq3 <- cbind(patient_id = rownames(use_inc_freq3), use_inc_freq3)
use_inc_freq3_time <- left_join(use_inc_freq3, patient_inc %>% select(patient_id, sex, follow_up_time,dx_date), by = "patient_id") %>%
  mutate(PC_physician= PC_physician/follow_up_time,
         PC_nurse=PC_nurse/follow_up_time,
         PC_social_worker=PC_social_worker/follow_up_time,
         PC_emergency_service=PC_emergency_service/follow_up_time,
         PC_others=PC_others/follow_up_time,
         Specialized_visit_physician=Specialized_visit_physician/follow_up_time,
         Specialized_visit_nurse=Specialized_visit_nurse/follow_up_time,
         Specialized_visit_unknown_professional=Specialized_visit_unknown_professional/follow_up_time,
         sexo=factor(sex, levels=c(0,1), labels=c("Male", "Female"))) 


use_inc_tab <- rbind(
  rbind(data.frame(label = "at care center", levels = '',
                   Male = '', Female = '', Total = ''),
        use_inc_freq1_time %>%
          summary_factorlist(dependent ,c("PC_physician", "PC_nurse", "PC_social_worker",
                                          "PC_emergency_service","PC_others",
                                          "Specialized_visit_physician",
                                          "Specialized_visit_nurse",
                                          "Specialized_visit_unknown_professional"),
                             total_col = TRUE,cont="mean",cont_cut=1,
                             na_include = TRUE,na_to_prop =  FALSE)),
  rbind(data.frame(label = "at home", levels = '',
                   Male = '', Female = '', Total = ''),
        use_inc_freq2_time %>%
          summary_factorlist(dependent ,c("PC_physician", "PC_nurse", "PC_social_worker",
                                          "PC_emergency_service","PC_others",
                                          "Specialized_visit_physician",
                                          "Specialized_visit_nurse",
                                          "Specialized_visit_unknown_professional"),
                             total_col = TRUE,cont="mean",cont_cut=1,
                             na_include = TRUE,na_to_prop =  FALSE)),
  rbind(data.frame(label = "by telephone or similar", levels = '',
                   Male = '', Female = '', Total = ''),
        use_inc_freq3_time %>%
          summary_factorlist(dependent ,c("PC_physician", "PC_nurse", "PC_social_worker",
                                          "PC_emergency_service","PC_others",
                                          "Specialized_visit_physician",
                                          "Specialized_visit_nurse",
                                          "Specialized_visit_unknown_professional"),
                             total_col = TRUE,cont="mean",cont_cut=1,
                             na_include = TRUE,na_to_prop =  FALSE)))

kable(use_inc_tab, row.names=FALSE, align=c("l", "l", rep("r",5)))
Table 3: Use of Primary Care Services per year of incidents patients
label levels Male Female Total
at care center
PC_physician Mean (SD) 0.6 (5.8) 0.6 (4.6) 0.6 (5.2)
PC_nurse Mean (SD) 1.3 (17.6) 0.5 (5.5) 0.9 (12.9)
PC_social_worker Mean (SD) 0.6 (6.0) 0.3 (1.3) 0.5 (4.2)
PC_emergency_service Mean (SD) 0.6 (8.4) 0.4 (2.7) 0.5 (6.1)
PC_others Mean (SD) 0.4 (2.1) 0.3 (1.5) 0.3 (1.8)
Specialized_visit_physician Mean (SD) 0.4 (2.1) 0.4 (2.3) 0.4 (2.2)
Specialized_visit_nurse Mean (SD) 0.4 (2.4) 0.3 (1.3) 0.3 (1.9)
Specialized_visit_unknown_professional Mean (SD) 1.0 (16.7) 0.4 (2.6) 0.7 (11.7)
at home
PC_physician Mean (SD) 0.6 (3.8) 0.2 (1.2) 0.4 (2.8)
PC_nurse Mean (SD) 1.0 (17.3) 0.4 (1.5) 0.7 (11.9)
PC_social_worker Mean (SD) 0.5 (5.9) 0.3 (1.5) 0.4 (4.2)
PC_emergency_service Mean (SD) 0.7 (8.7) 0.8 (6.6) 0.7 (7.6)
PC_others Mean (SD) 0.3 (1.0) 0.5 (3.1) 0.4 (2.4)
Specialized_visit_physician Mean (SD) 0.2 (1.1) 0.9 (7.0) 0.6 (5.1)
Specialized_visit_nurse Mean (SD) 1.1 (17.3) 0.4 (1.9) 0.7 (11.9)
Specialized_visit_unknown_professional Mean (SD) 0.3 (1.4) 0.6 (4.8) 0.5 (3.7)
by telephone or similar
PC_physician Mean (SD) 0.4 (3.0) 0.4 (2.0) 0.4 (2.5)
PC_nurse Mean (SD) 0.3 (0.8) 0.6 (3.3) 0.4 (2.4)
PC_social_worker Mean (SD) 1.1 (17.2) 0.5 (3.4) 0.8 (12.1)
PC_emergency_service Mean (SD) 0.4 (2.7) 0.6 (2.5) 0.5 (2.6)
PC_others Mean (SD) 0.7 (6.4) 0.5 (5.6) 0.6 (6.0)
Specialized_visit_physician Mean (SD) 0.3 (2.1) 0.3 (1.3) 0.3 (1.7)
Specialized_visit_nurse Mean (SD) 1.1 (17.2) 1.0 (10.2) 1.1 (14.0)
Specialized_visit_unknown_professional Mean (SD) 0.5 (5.8) 0.5 (5.6) 0.5 (5.7)
Code
patient_condition <- dbGetQuery(con,
      "SELECT patient_id,type AS initial_status,code, \
      CASE WHEN code='end' THEN type \
      WHEN code='bmi<30' THEN 'else' \
      WHEN code IN ('bmi>=30','E66.01','E66.09','E66.1','E66.2','E66.8','E66.9') \
      THEN 'ob' \
      WHEN code='f_date' THEN 'f' \
      WHEN code IN ('N18','N18.1','N18.2','N18.3','N18.30','N18.31','N18.32',
      'N18.4','N18.5','N18.6','N18.9','I12','I12.0','I12.9','filtglom') THEN 'ckd' \
      WHEN code IN ('I09.81', 'I11.0', 'I13.0', 'I13.2', 'I50', 'I50.0',
      'I50.1', 'I50.9') THEN 'hf' \
      WHEN code='death' THEN code \
      ELSE 'cvd' END AS final_status, \
      DATEDIFF('day',dx_date,date) AS duration \
      FROM (SELECT pc.*,pi.dx_date \
      FROM patient_condition pc LEFT JOIN patient_incidents_view pi ON \
      pc.patient_id=pi.patient_id)") 

transition_sum <- patient_condition %>%
  group_by(initial_status, final_status) %>%
  summarise(
    total_N = n(),  # Number of repetitions for each transition
    first_quartile = quantile(duration, 0.25),  # First quantile of duration
    flow_time = quantile(duration, 0.5),  # Median
    third_quartile = quantile(duration, 0.75)   # Third quantile of duration
  ) %>% 
  group_by(initial_status) %>%
  mutate(transition_percentage = total_N / sum(total_N) * 100) %>%
  arrange(initial_status, desc(total_N)) %>%
  select(all_of(c("initial_status","final_status","total_N","transition_percentage",
                  "first_quartile","flow_time","third_quartile"))) 
transition_sum$initial_status = factor(transition_sum$initial_status,
                                       levels = c('cvd','hf','ckd','f','ob','else'))
transition_sum$final_status = factor(transition_sum$final_status,
                                levels = c('cvd','hf','ckd','f','ob','else','death'))

process_matrix <- ggplot(
  transition_sum, aes(x = final_status, y = initial_status)) +
  geom_tile(aes(fill = flow_time), color = "white") +
  geom_text(aes(label = total_N), vjust = 1) +
  scale_fill_distiller() +
  labs(title = "Predominant clinical conditions' changes",
       x = "Final Status",
       y = "Initial Status")+
  theme_minimal() +  # Apply a minimal theme for a cleaner look
  theme(axis.text = element_text(size = 12), 
        axis.title = element_text(size = 14, face = "bold"))  # Adjust text sizes

ggsave(gsub("src/analysis-scripts/","",here("outputs/predominant_condition_matrix.png")),plot=process_matrix)

Figure 12: Predominant clinical condition’s first change in incident patients. The number of the boxes shows the number of people experienced the change and the colour of the box the number of days needed to change.
Code
patient_pre <- dbGetQuery(con,
                          "SELECT *,
                       CASE WHEN sex = 0 THEN 'Male' 
                       ELSE 'Female' END AS sexo,
                       FLOOR(DATEDIFF('day',month_nac,DATE '2017-01-01') / 365.25) 
                       AS 'age',
                       FLOOR(DATEDIFF('day',month_nac, dx_date) / 365.25) 
                       AS 'age_dx'FROM main.patient 
                       WHERE dx_date < '2017-01-01'") 
patient_pre <- patient_pre %>%
  left_join( country2cont %>% select(country, CC), by = "country") %>%
  mutate(education=factor(education, levels=c(0,1,2,3),
      labels=c("Without studies","Primary school",
               "High school","University")),
      copayment=factor(copayment,levels=c(0,1),
                       labels=c("less than 18000","more or equal than 18000")))
patient_pre$CC[patient_pre$country == 724] <- 'Spain'
dependent="sexo"
explanatory=c("age","age_dx","CC","copayment","education", "avg_income")  

patient_pre%>%summary_factorlist(dependent,
                                 explanatory,
                                 total_col = TRUE,
                                 cont="mean",
                                 cont_cut=7,
                                 na_include = TRUE,
                                 na_to_prop =  FALSE, 
                                 include_col_totals_percent =FALSE,
                                 add_col_totals = TRUE)-> patient_pre_tab
kable(patient_pre_tab, row.names=FALSE, align=c("l", "l", rep("r",5)))
Table 4: Demographic and socioeconomic characteristics of prevalents patients
label levels Female Male Total
Total N 7568 7188 14756
age Mean (SD) 78.6 (13.3) 78.1 (13.4) 78.4 (13.3)
age_dx Mean (SD) 75.6 (13.4) 75.1 (13.5) 75.3 (13.4)
CC AS 3784 (50.0) 3594 (50.0) 7378 (50.0)
EU 1892 (25.0) 1797 (25.0) 3689 (25.0)
OC 1892 (25.0) 1797 (25.0) 3689 (25.0)
copayment less than 18000 2476 (50.0) 2476 (50.4) 4952 (50.2)
more or equal than 18000 2480 (50.0) 2436 (49.6) 4916 (49.8)
(Missing) 2612 2276 4888
education Without studies 0 (NaN) 0 (NaN) 0 (NaN)
Primary school 0 (NaN) 0 (NaN) 0 (NaN)
High school 0 (NaN) 0 (NaN) 0 (NaN)
University 0 (NaN) 0 (NaN) 0 (NaN)
(Missing) 7568 7188 14756
avg_income (Missing) 7568 7188 14756
Code
param_pre <- dbGetQuery(con,"SELECT * FROM param_prevalents_pre_last_view")
param_list <- sort(unique(param_pre$param_name))
param_cat_pre <- dbGetQuery(con,
  "SELECT *, \
  CASE WHEN param_cat_name='physical_activity' AND param_cat_value=0 \
  THEN 'Incapacity' \
  WHEN param_cat_name='physical_activity' AND param_cat_value=1 \
  THEN 'Inactive' \
  WHEN param_cat_name='physical_activity' AND param_cat_value=2 \
  THEN 'Partially Active' \
  WHEN param_cat_name='physical_activity' AND param_cat_value=3 \
  THEN 'Active' \
  WHEN param_cat_name='smoking_status' AND param_cat_value=0 \
  THEN 'Non Smoker' \
  WHEN param_cat_name='smoking_status' AND param_cat_value=1 \
  THEN 'Ex-smoker < 1 year' \
  WHEN param_cat_name='smoking_status' AND param_cat_value=2 \
  THEN 'Ex-smoker >= 1 year' \
  WHEN param_cat_name='smoking_status' AND param_cat_value=3 \
  THEN 'Smoker' \
  WHEN param_cat_name='alcohol' AND param_cat_value=0 \
  THEN 'Abstinent' \
  WHEN param_cat_name='alcohol' AND param_cat_value=1 \
  THEN 'Moderate Drinker' \
  WHEN param_cat_name='alcohol' AND param_cat_value=2 \
  THEN 'Heavy Drinker' \
  WHEN param_cat_name='vaccination_flu' AND param_cat_value=0 \
  THEN 'No' \
  WHEN param_cat_name='vaccination_flu' AND param_cat_value=1 \
  THEN 'Yes' \
  WHEN param_cat_name='working_status' AND param_cat_value=0 \
  THEN 'Active' \
  WHEN param_cat_name='working_status' AND param_cat_value=1 \
  THEN 'Unemployed' \
  WHEN param_cat_name='working_status' AND param_cat_value=2 \
  THEN 'Pensionist' \
  ELSE '' END AS param_cat_value_str
  FROM param_cat_prevalents_pre_last_view WHERE param_cat_name!='vaccination_covid'")
param_cat_list <- sort(unique(param_cat_pre$param_cat_name))
dependent="sexo"


param_pre2unnest <- param_pre %>% 
  pivot_wider( id_cols=patient_id,names_from=param_name,values_from=param_value)%>%
  left_join( patient_pre %>% select(patient_id, sexo), by = "patient_id")

param_cat_pre2unnest <- param_cat_pre %>% 
  pivot_wider( id_cols=patient_id,names_from=param_cat_name,values_from=param_cat_value_str)%>%
  left_join( patient_pre %>% select(patient_id, sexo), by = "patient_id") 

param_pre_tab <- rbind(
  param_pre2unnest %>%
    summary_factorlist(dependent ,
                       param_list,
                       total_col = TRUE,cont="mean",na_include = TRUE,
                       na_to_prop =  FALSE,add_row_totals = TRUE,
                       include_row_missing_col = TRUE,
                       include_col_totals_percent   =FALSE,add_col_totals = TRUE),
  param_cat_pre2unnest %>%
    summary_factorlist(dependent ,
                       param_cat_list,
                       total_col = TRUE,cont="mean",na_include = TRUE,
                       na_to_prop =  FALSE,cont_cut=8,add_row_totals = TRUE,
                       include_row_missing_col = TRUE))
kable(param_pre_tab, row.names=FALSE, align=c("l", "l", rep("r",5)))
Table 5: Clinical and lifestyle characteristics of prevalent patients
label Total N Missing N levels Female Male Total
Total N 7368 7012 14380
alb 3472 (24.1) 10908 Mean (SD) 4.0 (0.5) 4.1 (0.6) 4.1 (0.5)
bmi 3528 (24.5) 10852 Mean (SD) 35.6 (9.5) 34.7 (9.5) 35.2 (9.5)
col 3548 (24.7) 10832 Mean (SD) 203.9 (60.4) 203.0 (62.8) 203.4 (61.6)
creat 3680 (25.6) 10700 Mean (SD) 1.7 (0.8) 1.7 (0.8) 1.7 (0.8)
dbp 3648 (25.4) 10732 Mean (SD) 79.2 (16.0) 77.2 (15.8) 78.2 (15.9)
filtglom 3660 (25.5) 10720 Mean (SD) 67.0 (31.2) 68.7 (29.9) 67.8 (30.6)
hba1c 3352 (23.3) 11028 Mean (SD) 8.5 (2.1) 8.6 (2.0) 8.5 (2.1)
hdl 3468 (24.1) 10912 Mean (SD) 59.2 (21.8) 58.1 (22.2) 58.6 (22.0)
height 3628 (25.2) 10752 Mean (SD) 1.7 (0.2) 1.7 (0.2) 1.7 (0.2)
indalbcr 3664 (25.5) 10716 Mean (SD) 567.0 (341.6) 578.8 (341.4) 572.9 (341.5)
ldl 3640 (25.3) 10740 Mean (SD) 124.1 (51.1) 122.9 (50.1) 123.6 (50.6)
sbp 3640 (25.3) 10740 Mean (SD) 144.2 (27.5) 140.7 (27.0) 142.6 (27.3)
trgl 3544 (24.6) 10836 Mean (SD) 244.0 (120.0) 242.4 (111.9) 243.2 (116.0)
weight 3476 (24.2) 10904 Mean (SD) 118.6 (48.1) 111.8 (47.1) 115.3 (47.7)
alcohol 1968 (22.8) 6648 Abstinent 380 (35.7) 312 (34.5) 692 (35.2)
Heavy Drinker 356 (33.5) 288 (31.9) 644 (32.7)
Moderate Drinker 328 (30.8) 304 (33.6) 632 (32.1)
family_history_CVD 2112 (24.5) 6504 1104 (100.0) 1008 (100.0) 2112 (100.0)
physical_activity 1936 (22.5) 6680 Active 248 (24.9) 260 (27.7) 508 (26.2)
Inactive 216 (21.7) 196 (20.9) 412 (21.3)
Incapacity 268 (26.9) 240 (25.5) 508 (26.2)
Partially Active 264 (26.5) 244 (26.0) 508 (26.2)
smoking_status 1904 (22.1) 6712 Ex-smoker < 1 year 296 (28.9) 208 (23.6) 504 (26.5)
Ex-smoker >= 1 year 284 (27.7) 232 (26.4) 516 (27.1)
Non Smoker 216 (21.1) 176 (20.0) 392 (20.6)
Smoker 228 (22.3) 264 (30.0) 492 (25.8)
vaccination_flu 2004 (23.3) 6612 Yes 1120 (100.0) 884 (100.0) 2004 (100.0)
working_status 1924 (22.3) 6692 Active 348 (34.5) 264 (28.8) 612 (31.8)
Pensionist 336 (33.3) 336 (36.7) 672 (34.9)
Unemployed 324 (32.1) 316 (34.5) 640 (33.3)
Code
patient_pre$deregistration_date <- patient_pre$deregistration_date %>%
  replace_na(as.Date('2023-01-01'))
patient_pre <- patient_pre %>% 
  mutate(follow_up_time=as.numeric(difftime(deregistration_date,                                          as.Date('2016-12-31'),units="days")/365.25))%>%
  filter(follow_up_time!=0)



ss_use_pre <- dbGetQuery(con," SELECT * FROM main.ss_use \
            WHERE patient_id IN (SELECT patient_id FROM main.patient \
                                 WHERE dx_date <'2017-01-01') AND
                         visit_date>='2017-01-01'") %>% mutate(
                           visit_type=factor(visit_type, levels=c(1,2,3,4,5,6,7,8),
                          labels=c("PC_physician", "PC_nurse", "PC_social_worker",
                                   "PC_emergency_service","PC_others",
                                   "Specialized_visit_physician",
                                   "Specialized_visit_nurse",
                                   "Specialized_visit_unknown_professional"))
                         )

ss_use_pre1 <- ss_use_pre %>% 
  filter(visit_loc == 1)
use_pre_freq1 <- as.data.frame.matrix(table(ss_use_pre1$patient_id,ss_use_pre1$visit_type))
use_pre_freq1 <- cbind(patient_id = rownames(use_pre_freq1), use_pre_freq1)
use_pre_freq1_time <- left_join(use_pre_freq1, patient_pre %>% select(patient_id, sex, follow_up_time,dx_date), by = "patient_id") %>%
  mutate(PC_physician= PC_physician/follow_up_time,
         PC_nurse=PC_nurse/follow_up_time,
         PC_social_worker=PC_social_worker/follow_up_time,
         PC_emergency_service=PC_emergency_service/follow_up_time,
         PC_others=PC_others/follow_up_time,
         Specialized_visit_physician=Specialized_visit_physician/follow_up_time,
         Specialized_visit_nurse=Specialized_visit_nurse/follow_up_time,
         Specialized_visit_unknown_professional=Specialized_visit_unknown_professional/follow_up_time,
         sexo=factor(sex, levels=c(0,1), labels=c("Male", "Female"))) 

ss_use_pre2 <- ss_use_pre %>% 
  filter(visit_loc == 2)
use_pre_freq2 <- as.data.frame.matrix(table(ss_use_pre2$patient_id,ss_use_pre2$visit_type))
use_pre_freq2 <- cbind(patient_id = rownames(use_pre_freq2), use_pre_freq2)
use_pre_freq2_time <- left_join(use_pre_freq2, patient_pre %>% select(patient_id, sex, follow_up_time,dx_date), by = "patient_id") %>%
  mutate(PC_physician= PC_physician/follow_up_time,
         PC_nurse=PC_nurse/follow_up_time,
         PC_social_worker=PC_social_worker/follow_up_time,
         PC_emergency_service=PC_emergency_service/follow_up_time,
         PC_others=PC_others/follow_up_time,
         Specialized_visit_physician=Specialized_visit_physician/follow_up_time,
         Specialized_visit_nurse=Specialized_visit_nurse/follow_up_time,
         Specialized_visit_unknown_professional=Specialized_visit_unknown_professional/follow_up_time,
         sexo=factor(sex, levels=c(0,1), labels=c("Male", "Female"))) 

ss_use_pre3 <- ss_use_pre %>% 
  filter(visit_loc == 3)
use_pre_freq3 <- as.data.frame.matrix(table(ss_use_pre3$patient_id,ss_use_pre3$visit_type))
use_pre_freq3 <- cbind(patient_id = rownames(use_pre_freq3), use_pre_freq3)
use_pre_freq3_time <- left_join(use_pre_freq3, patient_pre %>% select(patient_id, sex, follow_up_time,dx_date), by = "patient_id") %>%
  mutate(PC_physician= PC_physician/follow_up_time,
         PC_nurse=PC_nurse/follow_up_time,
         PC_social_worker=PC_social_worker/follow_up_time,
         PC_emergency_service=PC_emergency_service/follow_up_time,
         PC_others=PC_others/follow_up_time,
         Specialized_visit_physician=Specialized_visit_physician/follow_up_time,
         Specialized_visit_nurse=Specialized_visit_nurse/follow_up_time,
         Specialized_visit_unknown_professional=Specialized_visit_unknown_professional/follow_up_time,
         sexo=factor(sex, levels=c(0,1), labels=c("Male", "Female"))) 


use_pre_tab <- rbind(
  rbind(data.frame(label = "at care center", levels = '',
                   Male = '', Female = '', Total = ''),
        use_pre_freq1_time %>%
    summary_factorlist(dependent ,c("PC_physician", "PC_nurse", "PC_social_worker",
                                    "PC_emergency_service","PC_others",
                                    "Specialized_visit_physician",
                                    "Specialized_visit_nurse",
                                    "Specialized_visit_unknown_professional"),
                       total_col = TRUE,cont="mean",cont_cut=1,
                       na_include = TRUE,na_to_prop =  FALSE)),
    rbind(data.frame(label = "at home", levels = '',
                     Male = '', Female = '', Total = ''),
        use_pre_freq2_time %>%
    summary_factorlist(dependent ,c("PC_physician", "PC_nurse", "PC_social_worker",
                                    "PC_emergency_service","PC_others",
                                    "Specialized_visit_physician",
                                    "Specialized_visit_nurse",
                                    "Specialized_visit_unknown_professional"),
                        total_col = TRUE,cont="mean",cont_cut=1,
                        na_include = TRUE,na_to_prop =  FALSE)),
    rbind(data.frame(label = "by telephone or similar", levels = '',
                     Male = '', Female = '', Total = ''),
        use_pre_freq3_time %>%
    summary_factorlist(dependent ,c("PC_physician", "PC_nurse", "PC_social_worker",
                                    "PC_emergency_service","PC_others",
                                    "Specialized_visit_physician",
                                    "Specialized_visit_nurse",
                                    "Specialized_visit_unknown_professional"),
                        total_col = TRUE,cont="mean",cont_cut=1,
                        na_include = TRUE,na_to_prop =  FALSE)))

kable(use_pre_tab, row.names=FALSE, align=c("l", "l", rep("r",5)))
Table 6: Use of Primary Care Services of prevalent patients
label levels Male Female Total
at care center
PC_physician Mean (SD) 0.1 (2.3) -0.0 (2.8) 0.0 (2.6)
PC_nurse Mean (SD) 0.2 (6.8) -0.1 (5.6) 0.1 (6.2)
PC_social_worker Mean (SD) -0.3 (9.9) 0.1 (5.2) -0.1 (7.8)
PC_emergency_service Mean (SD) -0.1 (3.3) 0.0 (2.2) -0.1 (2.8)
PC_others Mean (SD) -0.1 (3.8) 0.2 (5.7) 0.0 (4.9)
Specialized_visit_physician Mean (SD) -0.1 (4.8) 0.2 (12.2) 0.0 (9.3)
Specialized_visit_nurse Mean (SD) -0.0 (13.3) 0.1 (5.4) 0.0 (10.0)
Specialized_visit_unknown_professional Mean (SD) -0.1 (2.8) 0.3 (9.4) 0.1 (7.0)
at home
PC_physician Mean (SD) -0.2 (10.2) 0.1 (4.2) -0.0 (7.7)
PC_nurse Mean (SD) 0.4 (11.3) 0.0 (6.6) 0.2 (9.2)
PC_social_worker Mean (SD) -0.1 (4.5) 0.1 (5.8) 0.0 (5.2)
PC_emergency_service Mean (SD) -0.1 (3.3) 0.1 (4.1) 0.0 (3.7)
PC_others Mean (SD) 0.1 (4.5) 0.1 (1.8) 0.1 (3.4)
Specialized_visit_physician Mean (SD) -0.1 (4.0) 0.3 (10.2) 0.1 (7.8)
Specialized_visit_nurse Mean (SD) 0.3 (10.0) 0.0 (7.6) 0.2 (8.8)
Specialized_visit_unknown_professional Mean (SD) -0.2 (3.6) 0.4 (11.1) 0.1 (8.3)
by telephone or similar
PC_physician Mean (SD) -0.0 (3.2) 0.0 (4.8) -0.0 (4.1)
PC_nurse Mean (SD) -0.2 (13.9) -0.1 (5.2) -0.1 (10.4)
PC_social_worker Mean (SD) 0.1 (10.0) 0.3 (10.2) 0.2 (10.1)
PC_emergency_service Mean (SD) -0.0 (4.8) -0.2 (9.9) -0.1 (7.8)
PC_others Mean (SD) -0.1 (3.9) -0.0 (5.5) -0.0 (4.7)
Specialized_visit_physician Mean (SD) -0.4 (11.5) 0.1 (3.5) -0.1 (8.4)
Specialized_visit_nurse Mean (SD) 0.1 (10.0) 0.2 (6.2) 0.2 (8.3)
Specialized_visit_unknown_professional Mean (SD) -0.1 (4.4) 0.0 (6.0) -0.0 (5.3)

Patients without predominant clinical condition

Event log’s creation and description

Choosing patients without any predominant clinical condition and after some preprocessing we obtain an event log that can be represented in the below process map Figure 13. There is shown how the events are connected, the mean number of days patients spent in each event (or treatment), percentage of patients who made each transition and the mean number of days it took to make the transition. However, a spaghetti map is obtained and nothing can be concluded. Therefore, we have to simplify the process map and for example only show the most frequent traces covering 20% of the event log as in Figure 14. Moreover, in Figure 15 we show percentage of patients’ traces each activity is present.

Figure 13: Event log’s process maps with all traces

Figure 14: Event log’s process maps with most frequent traces covering 20%

Figure 15: Percentage of patients’ traces an activity is present

Clustering traces

Once the set of traces to analyze are selected, there is a need to choose a distance measure to clustering. In this example Levenshtein similarity is chosen to calculate the distance matrix.

When distance matrix is acquired, we are able to cluster. However, the number of clusters have to be fixed before. With these figures we are able to conclude what could be the optimal number of clusters.

Figure 16: Distance matrix’s dendogram

Figure 17: Distance matrix’s elbow method’s graphic

Choosing the optimal number of clusters, too small clusters can appear, but we can exclude those of less than 30 traces to avoid having clusters of low representation. The process map and the most frequent traces of one of these clusters that remain are the followings.

(a) 5 most frequent traces

(b) Process map covering 25% most frequent traces

Figure 18: Cluster 0

Conformace checking

Once traces are clusterized, with a boxplot is easy to show that each cluster’s behavior with respect to the treatment guides is different. Comparing traces with Figure 6, calculating the fitness (1 being perfect match with the petri net and 0 being the lowest possible fitness) of each trace and grouping by cluster results in_ Figure 19.

Figure 19: Traces fitness distribution by cluster

Prediction Models

In the next example, we choose fitness, age, sex and copayment to try to predict a hospitalization with a time dependet Cox model, and the summary of it is:

Call:
coxph(formula = Surv(t_0, t_1, status) ~ age + sex + fitness, 
    data = df)

  n= 702, number of events= 63 

            coef exp(coef) se(coef)      z Pr(>|z|)
age     -0.02058   0.97963  0.02745 -0.750    0.454
sex1    -0.40668   0.66586  0.26622 -1.528    0.127
fitness -0.28259   0.75383  0.91797 -0.308    0.758

        exp(coef) exp(-coef) lower .95 upper .95
age        0.9796      1.021    0.9283     1.034
sex1       0.6659      1.502    0.3952     1.122
fitness    0.7538      1.327    0.1247     4.557

Concordance= 0.55  (se = 0.044 )
Likelihood ratio test= 2.6  on 3 df,   p=0.5
Wald test            = 2.58  on 3 df,   p=0.5
Score (logrank) test = 2.6  on 3 df,   p=0.5

Using same predictive variables a joint latent class model of two classes is made:

Be patient, grid search is running ...
Search completed, performing final estimation
Joint latent class model for quantitative outcome and competing risks 
     fitted by maximum likelihood method 
 
Jointlcmm(fixed = fitness ~ t_0, mixture = ~t_0, random = ~t_0, 
    subject = "ID2", ng = 2, survival = Surv(t_min, t_max, status2) ~ 
        age + sex, hazard = "Weibull", hazardtype = "PH", data = df, 
    verbose = FALSE)
 
Statistical Model: 
     Dataset: df 
     Number of subjects: 160 
     Number of observations: 702 
     Number of latent classes: 2 
     Number of parameters: 14  
     Event 1: 
        Number of events: 63
        Proportional hazards over latent classes and 
        Weibull baseline risk function 
 
Iteration process: 
     Convergence criteria satisfied 
     Number of iterations:  1 
     Convergence criteria: parameters= 3e-16 
                         : likelihood= 6.2e-11 
                         : second derivatives= 3.2e-11 
 
Goodness-of-fit statistics: 
     maximum log-likelihood: 53.16  
     AIC: -78.31  
     BIC: -35.26  
     Score test statistic for CI assumption: 2.505 (p-value=0.2858) 
 
Maximum Likelihood Estimates: 
 
Fixed effects in the class-membership model:
(the class of reference is the last class) 

                     coef      Se    Wald p-value
intercept class1 -0.28349 0.35838  -0.791 0.42893

Parameters in the proportional hazard model:

                             coef      Se    Wald p-value
event1 +/-sqrt(Weibull1)  0.05883 0.04304   1.367 0.17169
event1 +/-sqrt(Weibull2)  1.15803 0.06954  16.653 0.00000
event1 SurvPH class1      1.90432 0.44799   4.251 0.00002
age                      -0.03056 0.03008  -1.016 0.30974
sex1                     -0.34676 0.29674  -1.169 0.24258

Fixed effects in the longitudinal model:

                     coef      Se    Wald p-value
intercept class1  0.85420 0.02527  33.809 0.00000
intercept class2  0.81311 0.01876  43.341 0.00000
t_0 class1       -0.00116 0.00010 -11.153 0.00000
t_0 class2       -0.00062 0.00004 -17.622 0.00000


Variance-covariance matrix of the random-effects:
          intercept t_0
intercept   0.01693    
t_0        -0.00002   0

                            coef      Se
Residual standard error  0.08451 0.00276

 
Posterior classification based on longitudinal and time-to-event data: 
  class1 class2
N  57.00 103.00
%  35.62  64.38
 
Posterior classification table: 
     --> mean of posterior probabilities in each class 
        prob1  prob2
class1 0.7979 0.2021
class2 0.2258 0.7742
 
Posterior probabilities above a threshold (%): 
         class1 class2
prob>0.7  77.19  52.43
prob>0.8  54.39  45.63
prob>0.9  22.81  39.81
 
 
Posterior classification based only on longitudinal data: 
  class1 class2
N  54.00 106.00
%  33.75  66.25
 
png 
  2 

Figure 20: Class-specific weighted marginal and subject-specific mean predicted fitness trajectories

Figure 21: Class-specific mean predicted fitness trajectory

Figure 22: Class-specific survival functions

CV Disease

Event log’s creation and description

Choosing patients with a cardiovascular disease and after some preprocessing we obtain a new event log to make the same analysis it has done to patients without any predominant clinical condition.

Figure 23: Event log’s process maps with all traces

Figure 24: Event log’s process maps with most frequent traces covering 20%

Figure 25: Percentage of patients’ traces an activity is present

Clustering traces

Figure 26: Distance matrix’s dendogram

Figure 27: Distance matrix’s elbow method’s graphic

(a) 5 most frequent traces

(b) Process map covering 25% most frequent traces

Figure 28: Cluster 0

Conformace checking

Figure 29: Traces fitness distribution by cluster

Prediction Models

In the next example, we choose fitness, age, sex and copayment to try to predict a hospitalization with a time dependet Cox model, and the summary of it is:

Call:
coxph(formula = Surv(t_0, t_1, status) ~ age + sex + fitness, 
    data = df)

  n= 276, number of events= 41 

             coef exp(coef)  se(coef)      z Pr(>|z|)
age     -0.005415  0.994600  0.012044 -0.450    0.653
sex1    -0.446039  0.640159  0.321092 -1.389    0.165
fitness -0.699740  0.496714  1.491856 -0.469    0.639

        exp(coef) exp(-coef) lower .95 upper .95
age        0.9946      1.005   0.97140     1.018
sex1       0.6402      1.562   0.34117     1.201
fitness    0.4967      2.013   0.02668     9.246

Concordance= 0.589  (se = 0.05 )
Likelihood ratio test= 2.1  on 3 df,   p=0.6
Wald test            = 2.08  on 3 df,   p=0.6
Score (logrank) test = 2.11  on 3 df,   p=0.6

Using same predictive variables a joint latent class model of two classes is made:

Be patient, grid search is running ...
Search completed, performing final estimation
Joint latent class model for quantitative outcome and competing risks 
     fitted by maximum likelihood method 
 
Jointlcmm(fixed = fitness ~ t_0, mixture = ~t_0, random = ~t_0, 
    subject = "ID2", ng = 2, survival = Surv(t_min, t_max, status2) ~ 
        age + sex, hazard = "Weibull", hazardtype = "PH", data = df, 
    verbose = FALSE)
 
Statistical Model: 
     Dataset: df 
     Number of subjects: 72 
     Number of observations: 276 
     Number of latent classes: 2 
     Number of parameters: 14  
     Event 1: 
        Number of events: 41
        Proportional hazards over latent classes and 
        Weibull baseline risk function 
 
Iteration process: 
     Convergence criteria satisfied 
     Number of iterations:  2 
     Convergence criteria: parameters= 1.2e-09 
                         : likelihood= 4.5e-06 
                         : second derivatives= 1.7e-06 
 
Goodness-of-fit statistics: 
     maximum log-likelihood: 34.24  
     AIC: -40.47  
     BIC: -8.6  
     Score test statistic for CI assumption: 0.096 (p-value=0.9534) 
 
Maximum Likelihood Estimates: 
 
Fixed effects in the class-membership model:
(the class of reference is the last class) 

                     coef      Se   Wald p-value
intercept class1 -0.40879 0.48094 -0.850 0.39533

Parameters in the proportional hazard model:

                             coef      Se   Wald p-value
event1 +/-sqrt(Weibull1)  0.08098 0.04862  1.666 0.09578
event1 +/-sqrt(Weibull2)  1.03113 0.08180 12.605 0.00000
event1 SurvPH class1      2.01516 0.60864  3.311 0.00093
age                      -0.02223 0.01732 -1.283 0.19936
sex1                     -0.38288 0.39403 -0.972 0.33120

Fixed effects in the longitudinal model:

                     coef      Se   Wald p-value
intercept class1  0.56277 0.02326 24.190 0.00000
intercept class2  0.53931 0.01701 31.704 0.00000
t_0 class1       -0.00091 0.00023 -3.951 0.00008
t_0 class2       -0.00033 0.00004 -8.405 0.00000


Variance-covariance matrix of the random-effects:
          intercept t_0
intercept   0.00807    
t_0         0.00000   0

                            coef      Se
Residual standard error  0.05009 0.00292

 
Posterior classification based on longitudinal and time-to-event data: 
  class1 class2
N  29.00  43.00
%  40.28  59.72
 
Posterior classification table: 
     --> mean of posterior probabilities in each class 
        prob1  prob2
class1 0.7974 0.2026
class2 0.1306 0.8694
 
Posterior probabilities above a threshold (%): 
         class1 class2
prob>0.7  86.21  79.07
prob>0.8  58.62  65.12
prob>0.9  17.24  58.14
 
 
Posterior classification based only on longitudinal data: 
  class1 class2
N  19.00  53.00
%  26.39  73.61
 
png 
  2 

Figure 30: Class-specific weighted marginal and subject-specific mean predicted fitness trajectories

Figure 31: Class-specific mean predicted fitness trajectory

Figure 32: Class-specific survival functions

Heart Failure

Event log’s creation and description

Choosing patients with heart failure and after some preprocessing we obtain a new event log to make the same analysis it has done to patients without any predominant clinical condition.

Figure 33: Event log’s process maps with all traces

Figure 34: Event log’s process maps with most frequent traces covering 20%

Figure 35: Percentage of patients’ traces an activity is present

Clustering traces

Figure 36: Distance matrix’s dendogram

Figure 37: Distance matrix’s elbow method’s graphic

(a) 5 most frequent traces

(b) Process map covering 25% most frequent traces

Figure 38: Cluster 0

Conformace checking

Figure 39: Traces fitness distribution by cluster

Prediction Models

In the next example, we choose fitness, age, sex and copayment to try to predict a hospitalization with a time dependet Cox model, and the summary of it is:

Call:
coxph(formula = Surv(t_0, t_1, status) ~ age + sex + fitness, 
    data = df)

  n= 7, number of events= 1 

              coef  exp(coef)   se(coef) z Pr(>|z|)
age      3.995e-01  1.491e+00  1.551e+06 0        1
sex1    -3.196e+00  4.093e-02  1.241e+07 0        1
fitness  0.000e+00  1.000e+00  6.978e+08 0        1

        exp(coef) exp(-coef) lower .95 upper .95
age       1.49106     0.6707         0       Inf
sex1      0.04093    24.4318         0       Inf
fitness   1.00000     1.0000         0       Inf

Concordance= 1  (se = 0 )
Likelihood ratio test= 2.2  on 3 df,   p=0.5
Wald test            = 0  on 3 df,   p=1
Score (logrank) test = 2  on 3 df,   p=0.6

Using same predictive variables a joint latent class model of two classes is made:

Figure 40: Class-specific weighted marginal and subject-specific mean predicted fitness trajectories

Figure 41: Class-specific mean predicted fitness trajectory

Figure 42: Class-specific survival functions

Chronic kidney disease

Event log’s creation and description

Choosing patients with chronic kidney disease and after some preprocessing we obtain a new event log to make the same analysis it has done to patients without any predominant clinical condition.

Figure 43: Event log’s process maps with all traces

Figure 44: Event log’s process maps with most frequent traces covering 20%

Figure 45: Percentage of patients’ traces an activity is present

Clustering traces

Figure 46: Distance matrix’s dendogram

Figure 47: Distance matrix’s elbow method’s graphic

(a) 5 most frequent traces

(b) Process map covering 25% most frequent traces

Figure 48: Cluster 0

Conformace checking

Figure 49: Traces fitness distribution by cluster

Prediction Models

In the next example, we choose fitness, age, sex and copayment to try to predict a hospitalization with a time dependet Cox model, and the summary of it is:

Call:
coxph(formula = Surv(t_0, t_1, status) ~ age + sex + fitness, 
    data = df)

  n= 97, number of events= 15 

             coef exp(coef)  se(coef)      z Pr(>|z|)  
age      0.009163  1.009206  0.019369  0.473   0.6361  
sex1    -0.091847  0.912245  0.557819 -0.165   0.8692  
fitness -6.292425  0.001850  3.399764 -1.851   0.0642 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age       1.00921     0.9909 9.716e-01     1.048
sex1      0.91224     1.0962 3.057e-01     2.722
fitness   0.00185   540.4623 2.362e-06     1.449

Concordance= 0.702  (se = 0.073 )
Likelihood ratio test= 3.85  on 3 df,   p=0.3
Wald test            = 3.67  on 3 df,   p=0.3
Score (logrank) test = 3.83  on 3 df,   p=0.3

Using same predictive variables a joint latent class model of two classes is made:

Be patient, grid search is running ...
Search completed, performing final estimation
Joint latent class model for quantitative outcome and competing risks 
     fitted by maximum likelihood method 
 
Jointlcmm(fixed = fitness ~ t_0, mixture = ~t_0, random = ~t_0, 
    subject = "ID2", ng = 2, survival = Surv(t_min, t_max, status2) ~ 
        age + sex, hazard = "Weibull", hazardtype = "PH", data = df, 
    verbose = FALSE)
 
Statistical Model: 
     Dataset: df 
     Number of subjects: 26 
     Number of observations: 97 
     Number of latent classes: 2 
     Number of parameters: 14  
     Event 1: 
        Number of events: 15
        Proportional hazards over latent classes and 
        Weibull baseline risk function 
 
Iteration process: 
     Convergence criteria satisfied 
     Number of iterations:  1 
     Convergence criteria: parameters= 1.4e-11 
                         : likelihood= 1.8e-06 
                         : second derivatives= 8.7e-06 
 
Goodness-of-fit statistics: 
     maximum log-likelihood: 7.6  
     AIC: 12.8  
     BIC: 30.42  
 
 
Maximum Likelihood Estimates: 
 
Fixed effects in the class-membership model:
(the class of reference is the last class) 

                     coef      Se    Wald p-value
intercept class1 -0.00093 0.87912  -0.001 0.99916

Parameters in the proportional hazard model:

                             coef      Se    Wald p-value
event1 +/-sqrt(Weibull1)  0.02738 0.01728   1.585 0.11307
event1 +/-sqrt(Weibull2)  1.21071 0.17522   6.910 0.00000
event1 SurvPH class1      2.36080 1.05969   2.228 0.02589
age                       0.00952 0.02416   0.394 0.69370
sex1                     -0.76573 0.68034  -1.126 0.26038

Fixed effects in the longitudinal model:

                     coef      Se    Wald p-value
intercept class1  0.53595 0.05122  10.465 0.00000
intercept class2  0.61907 0.02639  23.463 0.00000
t_0 class1       -0.00080 0.00010  -7.832 0.00000
t_0 class2       -0.00046 0.00004 -11.727 0.00000


Variance-covariance matrix of the random-effects:
          intercept t_0
intercept   0.00439    
t_0         0.00000   0

                            coef      Se
Residual standard error  0.05782 0.00491

 
Posterior classification based on longitudinal and time-to-event data: 
  class1 class2
N  12.00  14.00
%  46.15  53.85
 
Posterior classification table: 
     --> mean of posterior probabilities in each class 
        prob1  prob2
class1 0.9251 0.0749
class2 0.1355 0.8645
 
Posterior probabilities above a threshold (%): 
         class1 class2
prob>0.7  91.67  85.71
prob>0.8  91.67  64.29
prob>0.9  75.00  50.00
 
 
Posterior classification based only on longitudinal data: 
  class1 class2
N     13     13
%     50     50
 
png 
  2 

Figure 50: Class-specific weighted marginal and subject-specific mean predicted fitness trajectories

Figure 51: Class-specific mean predicted fitness trajectory

Figure 52: Class-specific survival functions

Frailty

Event log’s creation and description

Choosing patients with frailty and after some preprocessing we obtain a new event log to make the same analysis it has done to patients without any predominant clinical condition.

Figure 53: Event log’s process maps with all traces

Figure 54: Event log’s process maps with most frequent traces covering 20%

Figure 55: Percentage of patients’ traces an activity is present

Clustering traces

Traceback (most recent call last):
  File "C:\PROGRA~3\ANACON~1\envs\python39\lib\site-packages\matplotlib\backends\backend_qt.py", line 468, in _draw_idle
    self.draw()
  File "C:\PROGRA~3\ANACON~1\envs\python39\lib\site-packages\matplotlib\backends\backend_agg.py", line 400, in draw
    self.figure.draw(self.renderer)
  File "C:\PROGRA~3\ANACON~1\envs\python39\lib\site-packages\matplotlib\artist.py", line 95, in draw_wrapper
    result = draw(artist, renderer, *args, **kwargs)
  File "C:\PROGRA~3\ANACON~1\envs\python39\lib\site-packages\matplotlib\artist.py", line 72, in draw_wrapper
    return draw(artist, renderer)
  File "C:\PROGRA~3\ANACON~1\envs\python39\lib\site-packages\matplotlib\figure.py", line 3140, in draw
    mimage._draw_list_compositing_images(
  File "C:\PROGRA~3\ANACON~1\envs\python39\lib\site-packages\matplotlib\image.py", line 131, in _draw_list_compositing_images
    a.draw(renderer)
  File "C:\PROGRA~3\ANACON~1\envs\python39\lib\site-packages\matplotlib\artist.py", line 72, in draw_wrapper
    return draw(artist, renderer)
  File "C:\PROGRA~3\ANACON~1\envs\python39\lib\site-packages\matplotlib\axes\_base.py", line 3028, in draw
    self._update_title_position(renderer)
  File "C:\PROGRA~3\ANACON~1\envs\python39\lib\site-packages\matplotlib\axes\_base.py", line 2961, in _update_title_position
    if (ax.xaxis.get_ticks_position() in ['top', 'unknown']
  File "C:\PROGRA~3\ANACON~1\envs\python39\lib\site-packages\matplotlib\axis.py", line 2451, in get_ticks_position
    self._get_ticks_position()]
  File "C:\PROGRA~3\ANACON~1\envs\python39\lib\site-packages\matplotlib\axis.py", line 2155, in _get_ticks_position
    major = self.majorTicks[0]
IndexError: list index out of range

Figure 56: Distance matrix’s dendogram

Figure 57: Distance matrix’s elbow method’s graphic

(a) 5 most frequent traces

(b) Process map covering 25% most frequent traces

Figure 58: Cluster 0

Conformace checking

Figure 59: Traces fitness distribution by cluster

Prediction Models

In the next example, we choose fitness, age, sex and copayment to try to predict a hospitalization with a time dependet Cox model, and the summary of it is:

Call:
coxph(formula = Surv(t_0, t_1, status) ~ age + sex + fitness, 
    data = df)

  n= 2037, number of events= 236 

             coef exp(coef)  se(coef)      z Pr(>|z|)   
age      0.012627  1.012708  0.007743  1.631  0.10293   
sex1    -0.263366  0.768460  0.130839 -2.013  0.04412 * 
fitness  1.950909  7.035079  0.718272  2.716  0.00661 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age        1.0127     0.9875    0.9975    1.0282
sex1       0.7685     1.3013    0.5946    0.9931
fitness    7.0351     0.1421    1.7214   28.7517

Concordance= 0.571  (se = 0.021 )
Likelihood ratio test= 14.09  on 3 df,   p=0.003
Wald test            = 14.18  on 3 df,   p=0.003
Score (logrank) test = 14.23  on 3 df,   p=0.003

Using same predictive variables a joint latent class model of two classes is made:

Be patient, grid search is running ...
Search completed, performing final estimation
Joint latent class model for quantitative outcome and competing risks 
     fitted by maximum likelihood method 
 
Jointlcmm(fixed = fitness ~ t_0, mixture = ~t_0, random = ~t_0, 
    subject = "ID2", ng = 2, survival = Surv(t_min, t_max, status2) ~ 
        age + sex, hazard = "Weibull", hazardtype = "PH", data = df, 
    verbose = FALSE)
 
Statistical Model: 
     Dataset: df 
     Number of subjects: 474 
     Number of observations: 2037 
     Number of latent classes: 2 
     Number of parameters: 14  
     Event 1: 
        Number of events: 236
        Proportional hazards over latent classes and 
        Weibull baseline risk function 
 
Iteration process: 
     Convergence criteria satisfied 
     Number of iterations:  1 
     Convergence criteria: parameters= 3.3e-17 
                         : likelihood= 5.1e-12 
                         : second derivatives= 5.4e-11 
 
Goodness-of-fit statistics: 
     maximum log-likelihood: 638.59  
     AIC: -1249.18  
     BIC: -1190.93  
     Score test statistic for CI assumption: 7.02 (p-value=0.0299) 
 
Maximum Likelihood Estimates: 
 
Fixed effects in the class-membership model:
(the class of reference is the last class) 

                     coef      Se    Wald p-value
intercept class1  0.72306 0.22974   3.147 0.00165

Parameters in the proportional hazard model:

                             coef      Se    Wald p-value
event1 +/-sqrt(Weibull1)  0.01512 0.00528   2.862 0.00421
event1 +/-sqrt(Weibull2)  1.04871 0.03097  33.858 0.00000
event1 SurvPH class1      1.93430 0.27024   7.158 0.00000
age                       0.01134 0.00837   1.354 0.17564
sex1                     -0.33813 0.14242  -2.374 0.01759

Fixed effects in the longitudinal model:

                     coef      Se    Wald p-value
intercept class1  0.60441 0.00675  89.477 0.00000
intercept class2  0.51196 0.01257  40.722 0.00000
t_0 class1       -0.00072 0.00003 -22.980 0.00000
t_0 class2       -0.00035 0.00003 -12.908 0.00000


Variance-covariance matrix of the random-effects:
          intercept t_0
intercept   0.00533    
t_0         0.00000   0

                            coef      Se
Residual standard error  0.05657 0.00111

 
Posterior classification based on longitudinal and time-to-event data: 
  class1 class2
N  337.0  137.0
%   71.1   28.9
 
Posterior classification table: 
     --> mean of posterior probabilities in each class 
        prob1  prob2
class1 0.8691 0.1309
class2 0.1915 0.8085
 
Posterior probabilities above a threshold (%): 
         class1 class2
prob>0.7  86.35  64.23
prob>0.8  78.04  53.28
prob>0.9  50.45  48.18
 
 
Posterior classification based only on longitudinal data: 
  class1 class2
N 360.00 114.00
%  75.95  24.05
 
png 
  2 

Figure 60: Class-specific weighted marginal and subject-specific mean predicted fitness trajectories

Figure 61: Class-specific mean predicted fitness trajectory

Figure 62: Class-specific survival functions

Obesity

Event log’s creation and description

Choosing patients with obesity and after some preprocessing we obtain a new event log to make the same analysis it has done to patients without any predominant clinical condition.

Figure 63: Event log’s process maps with all traces

Figure 64: Event log’s process maps with most frequent traces covering 20%

Figure 65: Percentage of patients’ traces an activity is present

Clustering traces

Figure 66: Distance matrix’s dendogram

Figure 67: Distance matrix’s elbow method’s graphic

(a) 5 most frequent traces

(b) Process map covering 25% most frequent traces

Figure 68: Cluster 0

Conformace checking

Figure 69: Traces fitness distribution by cluster

Prediction Models

In the next example, we choose fitness, age, sex and copayment to try to predict a hospitalization with a time dependet Cox model, and the summary of it is:

Call:
coxph(formula = Surv(t_0, t_1, status) ~ age + sex + fitness, 
    data = df)

  n= 332, number of events= 32 

            coef exp(coef) se(coef)      z Pr(>|z|)
age     -0.02762   0.97276  0.04598 -0.601    0.548
sex1     0.10049   1.10571  0.35819  0.281    0.779
fitness -0.54395   0.58045  1.75275 -0.310    0.756

        exp(coef) exp(-coef) lower .95 upper .95
age        0.9728     1.0280    0.8889     1.064
sex1       1.1057     0.9044    0.5480     2.231
fitness    0.5805     1.7228    0.0187    18.018

Concordance= 0.568  (se = 0.063 )
Likelihood ratio test= 0.53  on 3 df,   p=0.9
Wald test            = 0.53  on 3 df,   p=0.9
Score (logrank) test = 0.53  on 3 df,   p=0.9

Using same predictive variables a joint latent class model of two classes is made:

Be patient, grid search is running ...
Search completed, performing final estimation
Joint latent class model for quantitative outcome and competing risks 
     fitted by maximum likelihood method 
 
Jointlcmm(fixed = fitness ~ t_0, mixture = ~t_0, random = ~t_0, 
    subject = "ID2", ng = 2, survival = Surv(t_min, t_max, status2) ~ 
        age + sex, hazard = "Weibull", hazardtype = "PH", data = df, 
    verbose = FALSE)
 
Statistical Model: 
     Dataset: df 
     Number of subjects: 76 
     Number of observations: 332 
     Number of latent classes: 2 
     Number of parameters: 14  
     Event 1: 
        Number of events: 32
        Proportional hazards over latent classes and 
        Weibull baseline risk function 
 
Iteration process: 
     Convergence criteria satisfied 
     Number of iterations:  77 
     Convergence criteria: parameters= 9.4e-05 
                         : likelihood= 2.4e-05 
                         : second derivatives= 3.2e-05 
 
Goodness-of-fit statistics: 
     maximum log-likelihood: 170.22  
     AIC: -312.44  
     BIC: -279.81  
     Score test statistic for CI assumption: 0.692 (p-value=0.7076) 
 
Maximum Likelihood Estimates: 
 
Fixed effects in the class-membership model:
(the class of reference is the last class) 

                     coef       Se    Wald p-value
intercept class1  1.09973  0.36714   2.995 0.00274

Parameters in the proportional hazard model:

                             coef       Se    Wald p-value
event1 +/-sqrt(Weibull1) -0.00224  0.00902  -0.248 0.80426
event1 +/-sqrt(Weibull2)  1.19674  0.08955  13.364 0.00000
event1 SurvPH class1      9.29255 11.68361   0.795 0.42641
age                      -0.00811  0.04940  -0.164 0.86960
sex1                      0.22918  0.38309   0.598 0.54968

Fixed effects in the longitudinal model:

                     coef       Se    Wald p-value
intercept class1  0.58606  0.01352  43.339 0.00000
intercept class2  0.52208  0.02528  20.654 0.00000
t_0 class1       -0.00070  0.00005 -14.259 0.00000
t_0 class2       -0.00032  0.00006  -5.442 0.00000


Variance-covariance matrix of the random-effects:
          intercept t_0
intercept   0.00745    
t_0         0.00000   0

                            coef       Se
Residual standard error  0.04868  0.00254

 
Posterior classification based on longitudinal and time-to-event data: 
  class1 class2
N  63.00  13.00
%  82.89  17.11
 
Posterior classification table: 
     --> mean of posterior probabilities in each class 
        prob1  prob2
class1 0.8875 0.1125
class2 0.0849 0.9151
 
Posterior probabilities above a threshold (%): 
         class1 class2
prob>0.7  82.54  92.31
prob>0.8  73.02  84.62
prob>0.9  57.14  76.92
 
 
Posterior classification based only on longitudinal data: 
  class1 class2
N  65.00  11.00
%  85.53  14.47
 
png 
  2 

Figure 70: Class-specific weighted marginal and subject-specific mean predicted fitness trajectories

Figure 71: Class-specific mean predicted fitness trajectory

Figure 72: Class-specific survival functions

Process indicators’ analysis

We selected incident patients with at least one year of follow-up to create a process indicator’s event log. This log includes cholesterol, albumin-creatinine index, glycated hemoglobin, body mass index, blood pressure and glomerular filtration measures, and foot and eye examinations.

An activity presence analysis is done to see the percentage of presence of each indicator in analyzed traces.

Figure 73: Percentage of patients’ traces an activity is present during the first year

Figure 74: Percentage of patients’ traces an activity is present during the second year

Figure 75: Percentage of patients’ traces an activity is present during the third year

Figure 76: Percentage of patients’ traces an activity is present during the fourth year

Figure 77: Percentage of patients’ traces an activity is present during the fifth year

Conformace checking

As has been done before, in this section, the observed traces are compared with a specific theoretical process. However, instead of utilizing a single Petri net, five Petri nets are employed, one for each year, as the adherence is now considered in annual intervals from diabetes detection date.

Boxplot of process indicators’ traces’ fitness by cumulative years)

Treatments’ analysis’ and process indicators’ fitness are linked with the next plot:

Boxplot of process indicators’ traces’ fitness by cumulative year and by predominant clinical condition)

Prediction Models

In the next example, we choose fitness, age, sex and copayment to try to predict a hospitalization with a time dependent Cox model, and the summary of it is:

Call:
coxph(formula = Surv(t_0, t_1, status) ~ age + sex + fitness, 
    data = df)

  n= 721, number of events= 217 

            coef exp(coef) se(coef)      z Pr(>|z|)  
age      0.00315   1.00316  0.00527  0.598   0.5500  
sex1    -0.32424   0.72308  0.13661 -2.374   0.0176 *
fitness -0.56207   0.57003  0.38808 -1.448   0.1475  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age        1.0032     0.9969    0.9928    1.0136
sex1       0.7231     1.3830    0.5532    0.9451
fitness    0.5700     1.7543    0.2664    1.2196

Concordance= 0.558  (se = 0.021 )
Likelihood ratio test= 7.59  on 3 df,   p=0.06
Wald test            = 7.56  on 3 df,   p=0.06
Score (logrank) test = 7.6  on 3 df,   p=0.06

Using same predictive variables a joint latent class model of two classes is made:

Be patient, grid search is running ...
Search completed, performing final estimation
Joint latent class model for quantitative outcome and competing risks 
     fitted by maximum likelihood method 
 
Jointlcmm(fixed = fitness ~ t_0, mixture = ~t_0, random = ~t_0, 
    subject = "ID2", ng = 2, survival = Surv(t_min, t_max, status2) ~ 
        age + sex, hazard = "Weibull", hazardtype = "PH", data = df, 
    verbose = FALSE)
 
Statistical Model: 
     Dataset: df 
     Number of subjects: 458 
     Number of observations: 721 
     Number of latent classes: 2 
     Number of parameters: 14  
     Event 1: 
        Number of events: 217
        Proportional hazards over latent classes and 
        Weibull baseline risk function 
 
Iteration process: 
     Convergence criteria satisfied 
     Number of iterations:  1 
     Convergence criteria: parameters= 3.9e-19 
                         : likelihood= 1.8e-12 
                         : second derivatives= 2.7e-13 
 
Goodness-of-fit statistics: 
     maximum log-likelihood: -1241.45  
     AIC: 2510.9  
     BIC: 2568.68  
     Score test statistic for CI assumption: 0.325 (p-value=0.8501) 
 
Maximum Likelihood Estimates: 
 
Fixed effects in the class-membership model:
(the class of reference is the last class) 

                     coef      Se   Wald p-value
intercept class1 -0.81695 0.27704 -2.949 0.00319

Parameters in the proportional hazard model:

                             coef      Se   Wald p-value
event1 +/-sqrt(Weibull1)  0.02877 0.00695  4.138 0.00003
event1 +/-sqrt(Weibull2)  0.95979 0.02716 35.343 0.00000
event1 SurvPH class1      0.51304 0.25240  2.033 0.04209
age                       0.00523 0.00534  0.979 0.32736
sex1                     -0.32626 0.13780 -2.368 0.01790

Fixed effects in the longitudinal model:

                     coef      Se   Wald p-value
intercept class1  0.21100 0.02977  7.087 0.00000
intercept class2  0.42396 0.01331 31.862 0.00000
t_0 class1        0.00034 0.00004  9.316 0.00000
t_0 class2       -0.00001 0.00001 -0.652 0.51462


Variance-covariance matrix of the random-effects:
          intercept t_0
intercept    0.0173    
t_0          0.0000   0

                            coef      Se
Residual standard error  0.06460 0.00399

 
Posterior classification based on longitudinal and time-to-event data: 
  class1 class2
N 123.00 335.00
%  26.86  73.14
 
Posterior classification table: 
     --> mean of posterior probabilities in each class 
        prob1  prob2
class1 0.7494 0.2506
class2 0.1438 0.8562
 
Posterior probabilities above a threshold (%): 
         class1 class2
prob>0.7  60.16  86.57
prob>0.8  34.96  71.94
prob>0.9  28.46  51.94
 
 
Posterior classification based only on longitudinal data: 
  class1 class2
N 102.00 356.00
%  22.27  77.73
 
png 
  2 

Figure 78: Class-specific weighted marginal and subject-specific mean predicted fitness trajectories

Figure 79: Class-specific mean predicted fitness trajectory

Figure 80: Class-specific survival functions

Joint latent class models’ summary

(a) ‘else’ patients’ class-specific weighted marginal and subject-specific mean predicted treatments’ fitness trajectories

(b) ‘else’ patients’ class-specific mean predicted treatments’ fitness trajectory

(c) ‘else’ patients’ class-specific survival functions

(d) cvd patients’ class-specific weighted marginal and subject-specific mean predicted treatments’ fitness trajectories

(e) cvd patients’ class-specific mean predicted treatments’ fitness trajectory

(f) cvd patients’ class-specific survival functions

(g) hf patients’ class-specific weighted marginal and subject-specific mean predicted treatments’ fitness trajectories

(h) hf patients’ class-specific fitness mean predicted treatments’ fitness trajectory

(i) hf patients’ class-specific survival functions

(j) ckd patients’ class-specific weighted marginal and subject-specific mean predicted treatments’ fitness trajectories

(k) ckd patients’ class-specific mean predicted treatments’ fitness trajectory

(l) ckd patients’ class-specific survival functions

(m) Fragile patients’ class-specific weighted marginal and subject-specific mean predicted treatments’ fitness trajectories

(n) Fragile patients’ class-specific mean predicted treatments’ fitness trajectory

(o) Fragile patients’ class-specific survival functions

(p) Obese patients’ class-specific weighted marginal and subject-specific mean predicted treatments’ fitness trajectories

(q) Obese patients’ class-specific mean predicted treatments’ fitness trajectory

(r) Obese patients’ class-specific survival functions

(s) Class-specific weighted marginal and subject-specific mean predicted process indicators’ fitness trajectories

(t) Class-specific mean predicted process indicators’ fitness trajectory

(u) Class-specific survival functions

Figure 81: Joint latent class models’ plots’ panel